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^ A training set (TS) of document records with assigned utilities and a 
utility threshold defining document relevance are provided by a user. The 


TS is processed to give Boolean combinations of index terms for searching 
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a document file. A linear utility prediction function (LUFF) is fitted to- 
the TS documents using selected index terms. The LUES is thresholded and 
the resulting pseudo-Boolean inequality is solved^ giving term combinations 
for retrieving relevant documents. Algorithms are .presented end testing 
is discussed. . ’ 
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ABSTRACT 



DESIGN OP A DOCUMSHT RETRIEVAL SYSTEM USING PATTERN RECOGNITION 

V 

AND MATHEMATICAL PROGRAMMING TECHNIQUES 


Steven R. Borbash, Jr. , Ph.D. 

University of Pittsburg, 1970 

A pattern recognition (PR) model of the document retrieval 
process is introduced. This .model- P|r;ocesses a training set (TS) of 

f 

documents to derive file searching instructions. A file of indexed 
documents and a subsystem to implement search instructions is assumed 
to be available. Documents are represented as binary vectors of index 
terms. Tvro mutually exclusive categories of documents exist, A (rele- 
vant) or B (non-relevant) . Each document in the TS is assigned a 
utility u on an arbiti-ary scale by a user. All documents in the TS 
with u ^ 'c (a user specified threshold) are relevant. 

The system 'learns' from the TS to predict document utility as 
a linear fxxnction of the index terms and hence to recognize relevant 
documents. The TS is processed by feature extraction followed by es- 
timation of parameters in the linear utility prediction function 
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{IUPF). Feature extraction discards all but those index terms judged 
‘best' using an information theoretic estimate.' The LUPF parameters 
are those which give a 'best' approximation (in the norm sense) to 
the utilities of the TS documents as a function of the exbracted index 
verms. This approximation problem, is solved as a linear programming 
problem. 

After the LUFF has been estimated, relevant documents can be 
xdentified by applying the LUPF and the threshold t sequentially to 
all document vectors in the file. This is a 'weighted term' search. 
Equivalent Boolean search instructions (called a Boolean retrieval 
strategy dr BBS) can be derived by solving the linear pseudo-Boolean 
inequality (IPBI) formed by the LUPF and the threshold. The solution 
to this LPBI is a group of index term combinations (solution families). 
All documents having index term combiiiations which match any one of 
the solution families will be relevant. Each solution familjr may be 
regarded as a 'matching template' for classifying pattern vectors, 
analytical derivation of the BPS shows the relation between 
'weighted terra' and 'Boolean' searches.. Other methods of BRS con- 
struction are subjective . An algorithm is given for solving the tPBI 
which explores a binary tree using a branch and exclude technique.' 

The PR model "was tested on the NASA document file using a de- 
signed factorial experiment. Human analj’‘sts and the PE system both 
produced BBS's fi’om the same training sets. The effectiveness of 



searches done yith these BRS's were 
approximately twice as effective as 
analysts supplement their TS's with 
able to the PR system. Suggestions 
are offered. 
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1.0 INTBODUCTIpK 


1 . 1 SuBffiiai'y 


1 . 11 Objective 

This dissertation presents details of the design and testing 

of a document retrieval system (DES)‘ using the NASA Scientific and 

(l 2 3 U)* 

Technical, Information System ^ . The analytical model used for 

the DRS treats the system as a pattern recognizer. The objective of 
the system is to automatically develop a set of Boolean file searching 
instructions from a sample of relevant and non-relevant documents . 

A computerized file of document numbers and and associated in- 
dex terms is assumed no be a\>aiiable, The system presented here re- 
ceives as input a sas^ile set of documents from this file. Each of the 
documents in the set has been assigned a personal utility by a user. 

In addition to the sample set, the user has specified a utility thres- 
hold T, which defines two categories, relevant and -not relevant. 

The system output is a set of searching instructions for re- 
trieving all other documents from the file which are predicted to be 
relevant, based on the examples provided in the sample set. The 
searching instructions are presented as Boolean combinations of index 
terms which are colie euively known as a Boolean retrieval strategy 
(BBS). The system is &ho\m on the next ‘page. 

■^'Parenthetical references placed super! oi- to the line of text 
refer to the bibliography . 
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sao^le set from file 



^ file seai'cliing 
instructions 


1,12 Moti-vation 


A DRS vhich functions as described above provides a ne^ir method 
for 3, user to interact with a computerized file. This method elimi- 
nates- some pressing practical problems. In addition^ it provides a 
new analyticaX fr^ework for studying the retrieval process. 

There are practical problems associated with the present method 
of communication between the human user 'and the computerized file. The 
NASA system currently accepts file searching instructions in the form 
of a subjectively derived BRS submitted by a user. All documents which 
match this subjective BRS are then retrieved for the iiser- 

To form a BBS the user first selects a small subset of index 
terms . Next the user specifies Boolean combi-nations of these terms 
which he feels are meaningful. As an aid to index term selection arid 
eomhination, the user may consult a thesaurus and/or consider index 
term usage statistics.. The subjective determination of a BRS in -fchis 
manner is very difficult and fatiguing, and results are often xmsat-' 
isfactory. New methods are needed which Heip 'the user select and - 
combine terms. . . • ' - 

The DRS presented here provides this type of aid to a user. A 
training or example set of documenus is presented to the sysbem. ' The 
DRS attentats to 'learn' how to discriminate bebween relevan-c and 
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non -relevant docuDients ly using this set. Thus the DRS becomes an in- 
tellectual tool of the user and acts as his 'agent' to derive a BRS. 
This system allows the user to concentrate his efforts on making value 
judgments of documents in the training set. It relieves him of the 

combinatorial problems of BRS formation. 

Analytically, the model used here allows pattern recognition 
and mathematical programming techniques developed for pattern recog- 
nition systems to be applied direcbly to the document retrieval prob- 
lem. In addition to supplying numerical techniques, the model sug- 
gests many extensions for further study. 

1.13 Relationship to the Work of Others 

The DRS model developed here fills an important ga.p in the 

literature. This results froDi concentrating only on deriving- the BRS 

1 • 

from the training set. Both automatic index term extraction and the 
techniques of carrying out search requests have been excluded from 
consideration here. A file of indexed documents is assumed to exist, 
along with a s 5 '-stem for carrying out search instructions. 

In other DRS's, automatic index term extraction from f'oll 
English text has occupied a large portion of the analj’-tical ef- 
fort Still other researchers have been concerned mainly with 

the file structure and/or the mechanics of carrying out search re- 
qxiests ^ ’ . Generally a specified set of search instructions is 

1 ‘egarded as the input or query to their systems. 



In the system here, the BRS is developed analytically from the 
training set hy first deriving a set of index term weights and then 
'developing the ERS from these. Others have used weighted term systems 
to carry out file searches. The index term weights are quite often 
assigned subjectively^ and occasionally hy analytical 
methods ^ . The analytical method used here to derive temn 
weights is ne-5/, and depends upon user-assigned doc’omenb utilities . 

■ An important new result here is that the ERS is simply an al- 
ternate way to express \reighted term search instructions. Thus, given 
any set of index term weights and a threshold, it is possible to de- 
rive an equivalent BRS using algorithms presented here. Others have 
attempted to specify index term vreights which would simulate a given 
subjective- BRs( . This is the inverse of the approach taken 
here. 

1.14 Methods Used 

A utility prediction function for documents is constructed 
from the training set. This utility function is used, together with 
the user-specified tlireshold r to retrieve documents from the file 
wMch are predicted to be relevant. 

In the context of pattern recognition systems, the threshold 
utility prediction function is a decision function . Each document in 
the system is represented as a vector x of index terms which is then 
assigned to one of tAfo mutually exclusive categories, ' relevant ’ or 
'non-relevant' by applying the decision function [f(x) - t]. 
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The -craining set is submitted by the user. Each document in 
this set is assigned a utility on. a pre-deterrained scale. Both rel- 
evant and non-re levant documents are represented. Feature extraction 
(dimensionality reduction) is first performed on training set vectors 
to reduce their dimensions. A subset of index terms is selected using 
an information theoretic measure. This measure gives an estimate of 
how well individual index terms discriminate between relevant and non- 
relevant documents in the training set. 

Next a linear decision function is estimated using the reduced 
(in size) training set vectors. (For this application^ f(x) is a 
linear utility prediction function (LUFF).) Parameters in this linear 
model are estimated from the training set using the L]_ norm criterion 
of best approximation. This estimation problem is set up as a linear 
Xsrogram and solved using the simplex algorithm. 

Finally^ by applying 'the LUFF to documents not in the training’ 
set, it is possible to identify all the documents in the file which 
are predicted to be relevant. 

This identification can be done in two ways. By evaluating 
f(x) for each x and comparing this to the threshold t, each x 
may be classified individually. This method is appropriate for 
searching a sequentially structured file (SSF). An alternate method 
is to solve the linear pseudo-Boolean inequality (LFBI) , ^(£) > for 
its solution families. This gives Boolean combinations of index terms 
which are the analytically derived BBS. The BBS form of the LUFF is 
necessary for searching an inversely structured file (ISF). 
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The BES derived above is a set of matchiap; templates which can 

■be placed over a pattern vector x to categorize it. Each template 

! 

corresponds to a -solution family of the LPBI. Solution families to 
the LPBI are obtained using a branch-and-exelude binary tree search 
algorithm. -Fig, 1-1 shows a block diagram of the system. 

1.15 Testing and Results 

Training sets were prepared for several test q.uestions . Using 
these training sets, BBS’s were written both automatically by the sys- 
tem and by a group of experienced NASA system users. A portion of the 
NASA file was searched using each of the BRS's. 

Relevant documents had been identified beforehand and a meas- 

I 

ure of effectiveness was developed for each search which used this 
fact , I 

Test results showed that the machine-derived BRS's were only 
about half as effective as the subjective user-derived BRS’s. Differ- 
ences appear to be largely attributable to the lise (by humans) of 
supplementary information not contained in the training set. 

1.16 Conclusions 

It is concluded that the pattern recognition model of doeumenr 
retrieval employed here is very useful for deriving an ahalytical BRS. 
However, more work is needed to increase the practical effectiveness of 
the automatic system, particularly in the area of feature extraction . 
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1.2 Structure and Assumptions of the Model 

The analytical assumptions made to model the process are 
listed and discussed below. 

1.21 DoGument Representation and Pile Structure 


A file of indexed documents is assumed to exist.- Each docu- 
ment d^,. k=l, 25 . . . ,D in the file is represented as a binary vector 
(x^j^)j i=l,2j . . . ,f? , of index terms, chosen from a master list 
having Q terms. If index term i is assigned as a characteristic 
to document d^, then x^^^ - 1. Otherwise x^^^ = 0. For example, in 
the NASA system, fi = 13,000; D = 500,000 and about eleven x., = 1 

XK. 

for each k. 

The entire file may be conveniently pictured as a biliary . . 
do cument-t erm . mat rix having rows and D columns. Each row index 

corresponds to an index term T^ , where all' terras are arranged in some 
standard order (such as alphabetically) and each column index k cor- 
responds to a docum.ent number n^^, "where all document numbers are also 
arranged in -some standard order (such as chronologically). Because 
the matrix is very sparse , it is convenient bo represent it in a more 
compact form. There are two ways to readily do this by collapsing 
either the matrix columns or rows. 

To collapse the matrix columns, represent each eoluimx (docu- 
ment) vector 2 ^ as a list of row indices = ^\l ’ ’ * ’ ’■''■'kr' ^ 

having P, members. Here are row indices corresponding to 

k jk • . . 
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= 1 . The list; simply identifies the index tenns used vith a 

given document . For example , ^d-th the NASA system there would he about 
500,000 lists having an average of 11 members each. A data structure 
can now be defined having a master list of dociunent numbers n, ; 
h=l, 2 ,...,D where each has arr associated sub-list of index 

term numbers. This data structure will be defined as a seq^uentially 
■ structured file (SSF). 

Alternately, it is possible to collapse the matrix rows,' Each 

row can he represented as a list of column indices having A. 

members. C. = (c.t,.,.,c. ), where c . , are column indices corre- 

1 iJL 

snonding to x. . = 1 . This list identifies the dociunents associated 
1 j , 

with the index term T. . The corresponding data structure has a 

—X 

master list of index, terms, wjitlv each term having an associated sub- 
list of document numbers. This data structure is defined as "an in- 


versely structured file (iSF)'.' 

Observe that to locate in an SSP all d^ with 
it is necessary to examine every -list form from this list 

and then make a decision. 

With an ISP, searching is done only with specified Boolean 
combinations of index terras (the BES), Appropriate set operations on 


the lists associated with the terms T^ will give a resultant 

set of document numbers . Since usually only a small subset of all T^ 
are specified in the BRS, the search of an ISF is more economical than 
the search of an SSF. The conversion of the condition f(x ) > t to 
an equivalent BSS allows the more economical ISP search bo be 



10 


substituted for the SSF search. Given a file, it is easy to convert it 
from an ISF to an SBP or vice versa. We ¥ill represent a file of index 
terms and document numbers in either form as F(X,T. ,n, ), 

1.22 Fundamental Assumptions 


1.221 File Existence. A file F(X,T. ,n, ) of indexed dociments 

i Jt 

dj^ ,k=l , . . . jD exists . The ,k=l ,2 , . ,D are document numbers , 
while the ji=l 52 , . . . are index terms. 

1.222 Document Utility . Each document d, represented in the file • 
has a personal utility u^ to a given user at a given time. The util- 
ities u^ can be measured on an arbitrary scale. 

1.223 Document Relevance . A threshold x (dependent on the chosen 
utility scale) can be specified by a user to define relevant and non- 
relevant documents, (u^ ^ x=^d^ is relevant). 

1.224 System Ob.jectlve . The objective of the system is to provide a 
list from the file F of document numbers n^^, corresponding to all 
relevant ^ ^ 

1.225 Source of Information for Utility Prediction . _ The utility u^ 
'of any document d^ may be adequately predicted as some function of 

where ^ is the column vector of X associated with document - 
d^, i.8., u^ = f(^J. This assua^tion disallows the use of information 
which is not associated with the document characteristics in the file.- 


1.226 Dimensionality Reduction . For the purposes of any given user, 
all hut a small subset of all index terms may be neglected without a 



to fce 


significant loss of information. This allows the vectors 3^ 
reduced in dimension. 

1.227 Linear Utility Function . A prediction of document utility {L 
is adeq^uately given hy 




= f(2^) 






X 6. 
Jk 3 


I.22O Estimation of Parameters in the Linear Utility Fimction . The 

parajoieters 3 . ,j=Q,l, . . . ,n in the linear utility function may he 
J 

adequately estimated from exan5>les in a training set of m documents 
where m > n. ■ 


1.3 Limitations 


Assun^tions 1.221 through 1.225^ are rather general. Assump- 
‘ . **/* 

tion 1.225 in^lies that the quality of^'indexlng is adequate for the 

'* c, \ ' 

V \\ 

group of users who will retrieve from the '-file. 


Assumption 1.226 is quite restrictive since it assumes that 


all but .a small set of index terms may he discarded without a signif- 


icantly degrading system performance. This of course is always done 


hy users who form a BRS with only a few (from 3 to 15) index terms 


selected subjectively from the master list. This same assumption is 


also made frequently in pattern recognition systems design, where it 


is termed 'pre-processing' or 'feature extraction'. It is also 
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numerically necessarj'- to reduce the size of the vectors 3^ before 
continuing with the estimation problem of assumptions 1.22J and 1.228. 

Assun^tion 1.227 assumes a linear utility function for con- 
venience in estimating the parameters. This is a fairly restric- 
tive assumption, ' 

Assuar5)tion 1.228 implies that the sample adequately represents 
users interests over the entire file. The number of docuuaents m in 

the training set must be greater than or equal to the parameters g . 

J 

which are estimated. This relates to assmption 1.226, since the 
final reduced dimension of the- training set vectors fixes the maximum 
number of parameters which may be estimated. 


1.4 Organization of this Dissertation 

This dissertation is presented in nine chapters, which des- 
cribe system design and tests performed on the MSA file. 

Chapter 2 describes a simple pattern recognition system, but 
not in the context of document retrieval. An example problem illus- 
trates system operation. Example patterns are classified using both 
the linear decision function and the matching templates which are de- 
rived from it, by solving a pseudo-Boolean inequality. 

Chapter 3 relates the system of chapter 2 to a similar system 
for the document retrieval problem. Document utility is defined and 
measured on an arbitrary scale. A user specified threshold is intro- 
duced on this utility scale to define relevance . The decision function 
can now be interpreted as a utility prediction function . The matching 
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templates for classifying patterns are shovm to be identical in form 
and use to the subjective BPS. 

Chapter develops the information theoretic measure for ex- 


tracting best index terms as an extension of decision theory when 

utilities for action-outcome pairs are not knomi . This information 

\ 

theoretic measure has been used in other recognition systems for ex- 
tracting pattern features. See, for exs-mple , Lewis and Maltz^^^\ 

(17 ) 

The interpretation here is different and follows Watanabe more 
closely , 

Chapter 5 illustrates the determination of index term weights 

( 

by using approximation theory. The norm problem is formulated 

as a linear programming problem (see Barrodale^^^’^^^ ) .’ Examples are 
given illustrating alternate optimal solutions. Special properties 
of the solution are noted. , 

Chapter 6 presents the theory of pseudo-Boolean inequalities 

(20 21 22 ) 

as developed by Hammer and Budeanu ’ * ..A composite algorithm 

is presented here 'trhich solves a pseudo-Boolean inequality by a 

branch-and-exclude technique carried out in the context of a binary 

tree search. The basic branch-and-exclude technique is that dev”sioped 

by Hammer and Rudeanu. To implement this technique, a binary tree , 
( 23) 

traversal' Subalgorithra is introduced which controls and sequences 


the tree search. The composite algorithm is called the Tree Pruning 


1 (27) 

This name was used by E.y. Kozdro'vricki ' to describe a gen- 
eral process of branching and excluding in operations 'with tree struc- 
tures. Because of the accurate description which it also conveys about 
the operations of solving a pseudo-Boolean inequality, it is used\;again 
here. 
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Algorithm (TPA) . An example prchlem is solved and computational exper- 
ience with the TPA is discussed. 

Chapter T descrihes system testing which is carried out by 

3 

using a 2 factorial design. The main factor tested was the differ- 
ence in the effectiveness of searches performed using BRS's subjec- 
tively derived by analysts and ERS’s analytically derived by the 
methods of chapter 3. Three measures of effectiveness were used to 
evaluate search effectiveness . The more traditional measures of recall 

(2k 25) 

and precision were both used . In addition an information theo- 

(26) 

retie measure suggested by Meetham. was used. Other factors 
tested were those of training set' size and the number of extracted 
features . 

Chapter 8 discusses results of the testing, and presents an- 
cillary data felt to be of interest. Searches done using subjective 
BRS's were significantly more effective than those performed using the 
analytically derived BRS's, ' The differehce is largely attributahle to 
a significant difference in precision of subjective and machine 
searches . This difference in precision seems related to the humar. use 

i 

of information not contained in the training set. The extra informa- 
tion allows human analysts to avoid using index terms which have a high 
freq,uency of occurrence, even though they are excellent discriminators 
over the training set , 

Chapter 9 suggests improvements and extensions of some of the 


concepts which appear useful. The generality of the pattern recogni- 
tion model is apparent from the number of possible extensions. 
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Appendix A provides an example of the processing of a typical 
document training set to prcd.uce a BPS.' Programs were written in 
Fortran IV for the IBM 7094/70)+4 Direct Couple System. - 

It is concluded that the pattern recognition model presents 
a very convenient analytical framework to use for document retrieval 
system analysis and design o Resolution of significant differences 
between automatic systems and human beings appears to be within the ' 
realm of possibility if more sophisticated automatic systems are de- 
signed. 


1 . 5 Contributi ons 

The contributions of this dissertation are felt to be in three 
areas i models, methods, and data. 

1. 51 Models 

Modeling the derivation o_f the BRS as a pattern recognition 
problem is felt to be significant because it allows rigorous analytical 
methods developed by others (information theory, approximation theory, 
linear programming) ^o be applied directly to the document retrieval ■ 
problem. This is an application of existing technology to a new area. 

The conversion of a linear decision function to equivalent 
matching templates by solving an associated LPBI is a new application 
of pseudo-Boolean programming to pattern recognition systems. 

The analogj*- bet;?een the BES of document retrieval systems and 
the matching templates of pattern recognition systems makes this new 
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teii 5 )late-generation teclmique iimnecliately applicable to document re- 
trieval systems utilizing inversely structured files (iSF's). 

I 

, \ 

1.52 Methods 

Generation of matching templates "by solving an LPBI for its 
solution families is made practical by development of an algorithm 
to carry out the reauired computations q.tdck,ly and efficiently, h'o 
claim is made here to the general method of LPBI solution via branch- 
and-exclude operations in a binary tree. This is due to Hammer and 
Rudeanu, The contrihution here is the adaptation of a suh-algorithm 
to efficiently organize and 'seq.uence the branch-and-exelude operations . 

1.53 Data 

r 

Testing of the model and methods on the MSA document retrieval 
system has given new data on -which, to plan future system ‘modifications 
and retrieval experiments. 

• In additioUj a limited amo-unt of data is also available- on* 
operation 'of the ‘-TPA (tree pruning algorithm), for solution of the LPBI. 
This data should provide a basis for comparison of tbe present TPA 
■^•7ith‘ future modified versions as they are developed. 
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2.0 PATTEM EEC0GI?IT10N SYSTH4S 

This chapter introduces and briefly describes a pattern recog- 
nition system of the type which >iill be applied to the document re- 
trieval problem. 

The general concept's- of feature extraction, decision function 
forma-cion and template ma-cching operations are introduced and dis- 
cussed. One simple example is used throughout the chapter to illus- 
trate these concepts. 


2,1 Introduction 


Pattern recognition systems are concerned -with the automatic 
classification of patterns (represented as vectors) into two or more 
mutually exclusive categories . A training set of pre-classified pat- 
terns is assumed available to 'train' the recognition system. After 
'training' , patterns of unknown classification are presented to the 
recognition system. If the -training set was 'typical' in some sense, 
then the recognition system sho-uld classify the unknown patterns 
'reasonably well'. 

The simplest pattern recognition system is one which works 
with binary pattern vectors x= (x^,x ,...x ) where x.s{0,l}, and 
classifies all patterns into one of t-wo categories. This is the type 


of system to be considered here. For general references to the subject 

itterr 
(30) 


(28) (29) 

of pattern recognition, see for instance Nilsson , Nagy or 


J, Ho' 
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2,2 Pre-Processing 

Training of a recogni^ion system can be considered in two 
parts. The first part concerns representation of pictures or other 

(31) 

patterns as vectors, and will be called pre-processing . The sec- 
ond part estimates parameters of a decision function from vectors in 
the training set, 

2.21 Representation of the Pattern as a Vector 

Figure 2-lA illustrates a group of 5 simple patterns . A 
recognition system is desired which will distinguish between binary 
patterns representing pictures of the letters A and B. Let these pat- 
terns become the training set, which contains two 'pictures' of the 
letter A and three of the letter B,- Tlie grids of the pictures sho;-nx 
are x 4. If we agree to order the rectangular sub-elements of the 
pictures from left to right and from top to bottom, then we can repre- 
sent each picture of Fig, 2-lA as a binary vector 3 ^ as shoma in 

Fig, 2-lBj vrhere x = 1 if any element of the k 'picture of A' or B 

ik 

* th 

lies within the i rectangle and = 0 otherwise. 

2.22 Feature Extraction 

The next step in designing an automatic recognition system is 
usually to reduce the dimension of the pattern vectors by discarding 
vector elements which are ’ non-inf orraati ve ' . This operation is also 
known as 'feature extraction'. It is a very important portion of the’ 
pre-processing operation.. Heuristically we can see that vector 
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EXAMPLE SHOWING PATTERNS REPRESENTED AS VECTORS AND 
ILLUSTRATING FEATURE EXTRACTION 
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elements l,i| and 13-16 contain no information at all, since they are 
always zero, regardless of whether the pattern is an -A or a B. Vector 
element 2 is a perfect classifier of the patterns in the training set, 
since = 1 when Si=1^2 (letter A) and = 0 for lt=3_54^5 

(letter B). Vector elements 3,5 an'd 6-12 give some information about 
■tfie correct classification of the vectors even though they are not 
perfect predictors. 

The notion of information content over the training set can he 
formalized hy using the concept of entropy from information theory. 

This \^ill he done later. ’ Assume for illustrative purposes that all 
vector elements except 3,5 and 8 have been discarded. Then elements 
3,5 and 8 represent 'features' which have been extracted hy the in- 
formation screening process. The resulting 5 three-dimensional feature 


vectors are shoim in Fig. 2-lC. Uote that 


and z. 




2.3 Decision Function Specification 

The second major step in the machine training process is to 
specify a decision function. This function is given as j ~ f(^). It 
maps the feature vectors ^ of patterns of unknoTO classification into 
the dependent variable y on the real line . 

The form of the function f(^) is specified irhile the para- 
meters of f(^) are estimated from the training set. 

The decision function f(^) is used as follows. Assume a 
pattern vector x unknown classification is to he put into categorj,’' 
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A or B. First j vector x is reduced to vector ^ by extracting 
the features selected as being ’infomative ' over the training set. 

Theri f (^) = y is computed‘''and if y ^ t (a given threshold) , then 
the vector _z '(or x) is assigned to category A. If y < t, then ^ 
is assigned to category B. 

2.31 Selecting the Form of Decision Function 

There are tvo methods generally used to select the form of 
f(^). If the vectors ^ are from a kno^m multivariate probability 
distribution p(.z.)j then the form of f(^) may be derived from the 
form of this -distribution. The parameters of p(^) '^rhich appear in 
f(^) ■will be estimated from the training set. This is knoi-m as para- 
metric decision function formation, 

. TOie other method used to specifj'' f (^) is kno'vm as nonpar-a- - 
metric decision function formation. Here the form of f(^) is chosen 
as a matter of convenience , and the parameters are estimated from the 
training set samples. Honparametric methods are ■used -exclusively for 
the applications to be considered here. 

A very convenient form for the decision function and the one to 
be considered here is the linear function 

n 

—CO g < oo 

j 

z.e{0jl} i 
3 
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The are the feature weights, while the 3. are hinary elements of 

the feature vector z, t 


2.32 Estimating the Parameters of the Decision -Function 

The pareuneters 6 . are estimated by an approximati on process 
from the ssonples in the training set^ If the training set is large 
and typical of the universe of unknotm patterns to be classified, then 
good results should be expected when y = LU) is used to classify un- 
known patterns . 

2 . 321 The Associated Approximation Problem . There is considerable 


freedom in choosing a method of estimating the |B . . Nearly all 

methods involve the choice of a .best approximation to the g. based 

I d 

on the training set. 'Tliis tjpe of problem has been studied extensively 

I 

J 

by mathematicians, to whom it. is know as the discrete linear approx- 

(32 33) - • 

imation problem ’ . Consider the following relationships for a 

training set of n pattern vectors having m < n elements each: 


m 


^i 


B . a , . + r . t 
J ij 1 5 


i==l ,2 5 , . . ,n ; ~ 


0=0 


or 


= Z£ + r o 


Here x. is an (nxl) vector of known binarj’’ variables obtained from the 
training set; y. = +1 .if pattern i belongs to category A and 
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~ “1 othervise. 0. = (6^) is an (m x l) vector of para- 

meters (feature weights). Z = (z..) is a kno\<fn matrix (n x m) of 
"binary variables obtained from the training set. Rows of Z are the 
training set feature vectors The unkhovm vector of residuals ' 

(n X 1) is ^denoted by r = (r . 

The problem is to estimate g_. Call this estimate (note 

that ^ can. never be known exactly as long as the training set is 
only a sample of the universe of all- patterns . Note that ;y = Zb 
is an estimate of based on the estimate of Then ~ 

Zb = jr - i. 

2.322 Choosing the Criterion of Best Approximation . By a best esti- 
mate ^ of- ^ we shall mean the vector b which mnimizes the 
length (norm) j lr| | of -the vector r^. There are many -irays of spec- 

' ' ‘f>‘ 

ifying a. norm, An entire class of norms is given hy the L (L sub 

P 

■p) norms defined below : ^ 



T'Daen 



and we get the familiar least squares 


n 

problem, klien p = 1, we have J j limit as 

i^l 


p -> » we have li^(^) = max jr^ | . This is also known as the Chebyshev , 
uniform or max norm. The approximation problem can now be written as 
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follows. Find b, such that 

b = jnin I jr j | min j |‘y - Za] ! , - 

il a 

All practical applications of discrete approximation theory 
know to the author use either the L, , or L norms (or some 
Tariation of them), since these three formulations have solution al- 
gorithms which are reasonable to implement on a computer. Most appli- 
cations utilize the norm. The solution for ^ is given then by 

the familiar least Sfjuares normal equations' 

b = (Z'Z)"^Z'y. 

Both the and norm problems can be cast as linear programming 

(LP) problems, which are readily solvable by the sin^lex algorithm or 

{ Q7 *2ft *30 ^ 

one of its variations^ ’ ^ 

The popularity of the norm is due largely to the following 

items : 

(a) familiarity of the method^ and of the solution algorithms; 

(b) statistical applications of least square estimators when 

the are normally distributed^ and 

(c) uniqueness of the solution vector b. 

Least squares estimation has the disadvantage that the n x m 
matrix Z must have all m columns linearly independent to insure 
that (Z'Z) will be nonsingular. 
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The and estimators of ^ have the following charac- 
teristics: I 

r 

(a) ease of solution when formulated as LP problems; 

(b) the 32 X m matrix Z is not required to have all m 
columns linearly independent to guarantee a -solution; 

(c) the Lj and L estimators car. be better esti333g,tors of 

(hi) 

than Lg ‘when the r^ are not normally distributed ; 

(d) L-, and estimators are- not necessarily unique i The 

same minimal value of L(r) can be attained for more than one solution 
• - 

vecTor b 


The overall differences in estimates of based on L^, 
and norms can be negligible. Choice of a'noim for -applied prob- 


lems often depends upon practical 


C 0 1 * s i 3. w 2 T 8^*b i s « 


In the application to 'document retrieval systems to be pre- 
sented in chapter 3, the formulation will be utilized for the 

following two reasons: 

[a) the columns of Z cannot be guaranteed Indepesident so 

that further checking would be required if the Lg no3mi were used. 

(b) the problem is very rapidly and effieierxtly solved 

in the linear programitiing (LP) formulation. 


2,33 Current Methods of Fomning Decision Functions 


A great number of pattern recognition decision functions are 
linear. Several techniques for estimating the parameters are based 
on methods which are variations of the or norm. See for 
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(its) 

example Smith or Grinold. 


Least squares methods are also used. 


See for instance Y.C. Ho 


(if5) 


For another formulation less recogniz- 


able as an approximation problem, see Mangas arin''^^^ or Taylor ^ ^ , 


2. sit An Example Problem Illustrating Decision Function Determination 

In the example used to illustrate feature extraction, features 

3,5 8 were arbitrarily chosen, and the feature vectors ^ 

were formed. These vectors now represent the training set , instead of 

the vectors x^ ,. . . ,Xj_. 

—± 

Figure 2-2A shows the model y = this exair^ile . 

The least squares criterion is used to derive a solution ^ as shown 
in Fig. 2-2B, The least squares solution is used for this example 
problem , only . All subsequent problems will use the L^ norm criuer- 

ion. The residual vector for the least squares solution is shoTO in 

Fig. 2-2C-. ' ■ 

t 

In Fig. 2-2A the n = 5 rows of the matrix Z are the n 

feature vectors ^ which constitute the training set for the 

problem. Each vector ^ Is augmented by adding unity in the first 
position . 

The columns of fche matrix Z (excluding the first column) 
correspond to the 0/1 'features' which were extracted from the orig- 
inal training set vectors x. The first column is a vector of all 

I's which is included to allow a constant term in uhe decision func- 


tion. 
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FIGURE 2-^2 

EXAIvlPLE SHOWING LINEAR DECISION FUNCTION PiAR/ilffiTER ESTBIATION 
A. Linear Model Estimation 

= 2^ + r 



B. Least Sq^uares (minimal L2(il)) Solution for the Estimator of £ 

- 

-1 . 

h = mini - Z^| L = (Z'Z)“"Z'Y = ^ 

3 . 

0 


y = -1 + Iz^ + 2Zg + Oz^ 

T = 0 

* 

• C. Residual Vector for Least Squares Estimator 



|r| = ^/1 + 1 = V2 
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[The vector ^ of dependent variables consists of the elements 

•y- = +1 or -1 where y. = +1 is used for patterns of letter A in 

^ I 

the training set and = -1 is used for patterns of letter B. 

The problem is specified completely when t is chosen. The 
threshold t is used for making decisions after ^ is estimated. 

This threshold is somewhat arbitrai'ily specified as bhe midpoint 0 be- 
tween y^ = '+1 and y^^ - -I. If y > r = 0 for some unclassified 
pattern, then we agree to decide that this pattern z_ represents the 
letter A and if y < T, then ^ represents B. 

Fig. 2-2B shows the least squares solution h = (-1, 1,2,0). 

Here the feature has been assigned a weight zero (h^ = O) . 

The results of applying the model. to the training set as a pre- 

I 

, ’ I A 

diet or are given in Fig. 2-2 G, which compares y and y. Here the two 

I 

A patterns are correctly classified, !but one of the B patterns (pattern 
4) is misclassified or rejected since yi = 0. The linear relation- 

♦•f • 

ships 21 “ ^ thus not completely adequate to correctly classiiy 
all the documents in the training set. 

There is information lost at two points , ' First , the feature 
e.'xtraction process throws away information by discarding potentially 
important features. Secondly, the approximate linear decision function 
may introduce errors. Perhaps a better decision function would be non- 
linear. Or perhaps the training set should be larger. 

The fact that any pattern recognition system will make errors 

i 

must be accepted; although it must-be trained to -have a minimal (often 
zero) error for the sample patterns. The emphasis is on picking a 
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reasonable system design and then' adjusting it so that its recognition 
.error rate is acceptable for the application at hand. 

2.35 Relationship of the Decision Function to Curve-Fitting Problems 


'^y 


The standard curve-fitting or regression model is given 


and 'is identical tO' the decision function model. The difference is 
entirely in interpretation. .In ordinary function fitting applications 
the dependent variables are the yield of some process. In the 
pattern recognition problem, the y. are fixed at ±1, to indicate tvo 
diffei’ent categories . 

One uay of resolving the apparent difference between the two 
is to regard the y^ as the differences between two probabilities 

y^ = p(a/^) - p(B/^) 0 

Then since p(A/z.) = 1 and p(B/0 = 0 or vice versa for all training 
patterns in categories A or B, it follows that 

(p(A/_z) - p(B/^))e{-1,+1}. 

If we agi'ee to assign patterns to category A when y ^ t - 0 


we see that ; - 
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y = p(A/^) “ p(B/^) ^ T = 0 


p(A/ 0 ^ p(B/^) 


p(A/ z) 
p(S/z.) 


> 1 . 


Thus hy assigning patterns to category A when y ^ t = 0 ’ire are making 
a reasonable decision based on estimated probabilities . This explana- 
tion of the decision function can be called the "potential function" 
interpretation ^ . 

The independent variables z. . in the problem are binary. In 

the statistical literature linear least souares models of this' type are 

( 51 ) 

referred to as "experimental design models". 


2.4 Template Matching Operations 


2.41 Introduction 


Once the decision function is determined, the category of any 
unclassified pattern x may be estimated bj'’ first converting x to _z 


then by forming 





/ b ,z , 
^ D a 


and comparing this mth the thres- 


hold zero . 

There is an alternative to computing y and comparing it to a 
threshold. This is the formation of groups of one or more templates 
which compare specified combinations of binary features in the original 
pattern vectors Xj feature vectors z. 



2.k2 !The Pseudo-Boolean Inequality 


The mathematical motiYati-on behind this comes from the theory 
of pseudo-Boolean inequalities (binary variables and real coefficients 
Wote that : ' 


X 


y > T 



T 


n 

J=1 ■ 

J 

which is a pseudo-Boolean inequality 

Binary vectors ^ alee mapped onto the real line via the real 
coefficients b.. All binary vectors which satisfy the inequality 
are solution vectors. Each solution vector represents a binary pattern 
vector ^ which belongs to category A. The solution vectors ^ can 
be grouped and placed into one or more solution famlies. Each solu- 
tion vector belongs to one and only one family. 

The families specify a fixed configuration of either 0 or 1 
for some of the variables in the vector, and a free configuration for 
others . 

'To illustrate how solution vectors may be grouped into families 
consider some h^othetical inequality with six solution vectors ^ = 
and two solution families. 
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F U) = (l,0,-,0) i (1, 0,1,0) 

1 ^( 1 , 0 , 0 , 0 ) 

= (-,1,1,-)^:::^ f(l, 1,1,0) 

J (1, 1,1,1) 

S (0,1,1,0) 

( 0 , 1 , 1 , 1 ) 

' V. 

. All 6 solution vectors lie in either family F^(^) or family 
F^C^). ^ compact representation of 2 solution vectors while 

Fg(^) represents 4 solution vectors. Another way of writing the 
families is F^(^) = F^(^) = 

Families of solutions may be regarded as matching templates 
for the patterns . ,Z|^). For exaii^jle , F^(z_) requires 

the simultaneous presence of a 1 in components 2 and 3 of the 
vector All vectors ^ '(■Tith a 1 in both components 2 and 3 

will match the template F^C^). Similarly all vectors. with a 1 in 
position 1 and O's in both positions 2 and 4 will match the tem- 
plate F^(^). 

In this example all solution vectors belong to' either family 
Also, all solution vectors _z satisfy the 
thresholded. linear decision function given by 


n 



> T “ b e 
“ o 


It follows that all pattern vectors ^ which match either template 
F^(^) or Fg(^) belong to category A (y t) and all patterns which 

A. 

fail to match either template belong to category B (y < t).‘ 



It is convenient to define the characteristic fun-" ::r ^ 

of a pseudo-Booleaa inequaiitj as a matching operation on t-ce- m -t. 
of all solution families (ter^lates) ? 1 ^ 1 , 2 ,..., Ji. 

;hU) = U[Fj^(z)l 

k 

<!'(^) is a Boolean function >shich takes on the value of 1 ~:.e 

pflltern vector ^ matches cue of the M templates and takss. m. the 

value 0 when a match does not occur. 

It follows that 

<{)(-^) = 1=^ ^ Belongs to category A ^ 

^(z) = 0=^ Belongs to category B * 

OBserve that the solution of the pseudo-Boolean ineonalnty 
derived from the thresholded decision function involves no appmx- 
iination process. No information is lost. The matching tenpla.'tes 
for making binary decisions about the classification of pautem~c 
are merely an alternate form of implementing the decision function. 
Instead of adding weights for vector elements which are present and 
comparing the sum to a thre.shold, we look instead for the presence 
of configurations of points- If one of the configurations is ob- 
served^ we ‘automatically assign the pattern to category A. For some 
recognition systems this matching of configurations is a more ef- 
fective method of identifying patterns. Families of solutions to a 
linear pseudo-Boolean ineq.uality may be-found By a branch-and- 
ek'olude Binary tree search algorithm. 



2.43 An Example of Classification "by Template Matching 


3 


1. 

*T 


The example prohlem considered previously in this section has 
an associated pseudo-Boolean inequality 


y == b 


.3 

■f- b . z . - 

D a 


-1 +■ Iz^ + SZg + Os^ 


y > T::^y ^ o 


h .z . > (t ~ b ) 

J J o 


Iz^ + 2Zg 2!. * 


This pseudo-Boolean inequality has two solution families: 


U) = (-,1) 


FgU) = (1,0) ^ 


The family ^2^—^ only one solution vector and is said to be 
degenerate . The characteristic function of the unequality is 


^)(_z) = (Zp) U (z^Zg) 


Applying the P (^) = z = x^ template to each of the 5 

JL cL O 

patterns in the training set (see Fig. 2-lA) gives a match for 
pattern 2. The ^ 2 ^—^ “ ^1^2 ~ ^3^5 template gives a match for 
patterns 1 and 4, Thus patterns 1, 2 and 4 satisfy the character- 
istic function (<i)(^) = l) and are predicted to belong to categorj’' 
A. Pattern 4 is still incorrectly classified (see Fig. 2-2C). 
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2.5 Summary 

This chapter has introduced and illustrated the principles in- 
volved in the design of a recognition system of the type to he used 
for the document retrieval problem, This is the two-category system 
^sing binary pattern vectors and a non -parametric linear decision 
function. 

The steps involved in the design are; 

(a) representation of patterns as vecbors^ and choice of 
a training set; 

(b) feature extraction to reduce the pattern vector dimensions; 

(c) specification of a linear decision function and estimation 

■ 'of the parameters in this linear function. Parameters are estimated 
from the training set with a discrete linear approximation models 
and 

^(d) construction of templates from the decision function^ using 
the pseudo -Boolean inequality. This gives an alternate (to the 
linear decision function) method of categorizing new patterns. 



3 = 0 MODELING TiiE DOCUsfENT EETRIEVAL PK0CE3S AS 


A PATTERN PEGjGNITION SYSTEM 

This chapter first descrites a document retrieval system 
(DRS). • Eext an associated pattern recognition system is defined. 
The operations of characterizing thfe patterns , feature extraction , 
and decision function specification are related to the DES. The 
implementation of the decision function to retrieve relevant docu- 
ments from a file is pi'esented in detail. Computer methods are 
■briefly described. 


3.1 The Document Retrieval System 


3 . 11 General 


The system to be described here is qiiite general. In fact 

(53 5^) 

it is identical to the NASA document retrieval system , ’ . Thi 

is a large system 'vrhich has heen an operation since 1962. Approxi 
mately 500,000 documents (technicol repoarfcs and articles) are ac- 
cessible through the system. A master list of about 13,000 index 
terms is used to, index each document, with an average of about 11 


index terms per document. A variety of ser-‘/iees are" available to 
users of this system. Ccjiputer searches are performed in both a 

batch pr-ccessing^^^^ and a time-shared mode using remote 

/ } * 
terminals'* 
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3.12 Represeataticn of Documents in the File 

\ 

\ 

Each docianent acquine'd by t-he retrieval system is assigned 
both a unique identification number and a set of index terms (index 
set) -which are chosen from a master list. These index terms may in 
fact be phrases or -word groupings -which are deemed to have meaning 
to the users of the system. 

All acquired documents are placed in a library, while their 
identification numbers and index terms form a unit record which is 
placed in a computer file . 

3.13 Specification of File Search Instructions 


The 


J. 


lie is searched to identify documents which have speci- 


fied combinations of terms in their index sets. These index term 

I 

combinations are specified by 'the system users as intersections, 
-unions and negations of index terms. The entire set of matching 
instructions is sometimes referred to as a Boolean retrieval strategy 
(BRS)o a typical BRS is sho-wn below: 

. f 

((heat transfer + therraodynaraiG properties + thermal properties) 

*' (gases + gas flow)) - (fluid flow + fluid properties). 

The symbol (+) is used for union (or•)^ (*) is used for inter- 
section (and), while (-) represents negation (but not). Parentheses 
are used -\-rhere needed to avoid ambiguity, The-EE3 is specified 
subjectively by each user. 
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3.14 Satisfying User Needs 

The computerized search system applies the BRS to the file 
and produces a list of document numbers. 

Documents on this list match the BRS and may be recovered 
from the library. After looking at the actual documents (or ab- 
stracts of them) the user may elect to revise the BRS and search 
the file again. This can lead to an -iterative type of search. 

The user may elect to have an agent (called an information 
analyst) compose a BRS for him and screen the cited documents^, re- 
jecting those which do not (in the agent's opinion) match the user's 
interests. This practice relieves the user of the need to become 
■familiar with operational details of the system,, or with index 
term usage. A disadvantage is that -the agent may misinterpret the 
user's interests. 

Recent trends in the NASA DRS have been to introduce time- 
sharing facilities which permit direct user interaction with the 
file, and eliminate the need for an information analyst. 

3 . 15 Problem Areas 

There are numerous problem areas which can be associated 
with DRS’s. Some of these are: 

(a) poor search effectiveness; 

(b) lack of a standard measure of search effectiveness; 

(c) 'communication' difficulties between a human user and a 


computerized file; 
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(d) inadeq.uate indexing} and 

(e) lack of comprehensive analytical models for the above areas. 

' The alleviation of problem (c) above is the goal of this dis- 

sertation. A comprehensive analytical model is developed for the 
■user-file communication process. The communication of the user with 
the file here refers to the formulation of search instructions by 
the user to specify how the file will be searched. It is assumed 
that an indexed file of documents exists, and also' that a. software 
system exists which will implement search instructions. 

The present technique of subjectively selecting and com- 
bining index terms to form a BRS is very difficult. This diffi- 
culty is due to the J.arge number of index terms, the extremely 
large number of ways to combine these terms and differ'ences in 
word use between individuals (indexers and users). Each BRS which 
is subjectively formed requires solution of a difficult combinatorial 
problem. 

The subjectively formed BRS now functions as the input to 
a file searching system. In the model introduced below, a BRS is 
provided as" an end product . The user inputs Information in the form 
of .an example set of document n'umbers, with each document in the ex- 
ample set assigned a utility . In addition, each document is also 
assigned to one of two categories, relevant and non-relevant. This 
evaluated example set is all that is required of the user. The BRS 
formulation proceeds automatic ally using this information. None of 
the difficult combinatorial problems remain for the user. 
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The model used to automatically produce a BSS from an example 
set of documents is nearly identical vith the pattern recognition sys- 

I 

tem described in chapter 2. ■ Details of model development are given 
below . 

3.16 A Model of the Retrieval Process 


Consider a file of indexed documents . Assume first that each 
document d, in. the file has a utility u^ (or measure of usefulness) 
to a given -user at a given time. The utility of any given document can 
be determined by the userj and assigned a numerical value on some ar- 
bitrary scale (say 1 to lO)., These are reasonable assumptions repeat- 

(57) 

edly used in operations research studies. See for exaniple Fishbum 
or Hadle 3 '‘^^^^ . ’ j 

Hext assume that, dependent on the scale which is used to meas- 
ure document utility that a threshold x can be specifie’d by the user'" 
which divides all documents in the file into two classes. Those docu- 


ments 


with 


u^ ^ T are defined as being relevant. Those with 


Uj^ < T are not relevant . 


The goal of the retrieval system is to re- 
■ k 


trieve all relevant documents and not rstr-ieve any others . 
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index terms in the master list (about 13^000 for the JMSA system). Each 
= 1 if index term j is used to index document k and 0 other- 
wise. On the average, only about 11 of the will -be nonzero. 

3.22 Definition of T\to Categories , 

Each document d^_ is either relevant or nonrelevant depending 
on whether its utility ^ t or u^ < x. These constitute the two 
mutually exclusive categories to which each document belongs. The 
function of the system mil -be to recognize relevant documents,, or -to 
assign documents to category A of B based on properties of the assoc- 
iated pattern vector • . 

Each’ user defines his o^ra categories (relevant or not) depend- 
ing on his personal utility for documents 'in the file. A training set 
is formed which represents a sampling of the personal utility function ' 
of an individual user. Thus, each user has an individual pattern 
recognition system at his disposal. 


3.23 The Configuration of the System 


The pattern recognition system designed to recognize relevant 
docwfients has the general configuration discussed below. (See also 
Fig. 1-1.) 

3.231 Training- Set Formation .. The training set is composed of .docu- 
ments which have been. selected by the user as being typically relevant 
or non-relevant . An estimate of the utility u of each document in 


the training set is provided by the user. Documents in the training 
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set have heen located via a manual search hy the user, from a previous 
search, or from references provided hy others.'. If the seaxch is done 
iteratively, the training set grows and only the initial training set 
need he selected manuallj'-. 

3.232 Feature Extraction . All. index .terms -in the training set are 
ranked using an, information. theoretic measure of goodness.. This meas- 
ixre is the nijmber of hits of information which each index term individ- 
ually pro-vldes about the category of documents .in the training set. 
Details are given in chapter 4 . All index terms except a specified 
number ■with the highest. information measure are discarded. The re- 
tained index terms are the 'extracted features'. 

3.233 Decision .Function Formation . The pattern recognition system 

of this chapter attempts to classify documents as relevant or not based 
on their predicted utilities . The categories are not absolutes , but 
are defined "id. th. reference, to -an arbitrary utility scale. 

The system of chapter 2 was slightly different, in character. 
Categories A..and.B there were absolute. Parameters . in the decision 
function of chapter . 2 were estimated by solving an approximation prob- 
lem where the observed dependent variables y^ were dichotomous and 
could be regarded as the difference between two. probabilities . The 
goal of 'the approximation problem was to 'best' approximate y^ = 
p(A/^) - p(b/^). The. threshold was t = - 0 . 

The decision function of the present chapter is also set up as 
an approximation problem, but the objective is to approximate the user 
assigned utilities of documents in- the training set. The observed 
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dependent variables y^. are no longer dichotosnous and the threshold is 

now set by the user instead of being fixed at zerot 

! 

•Another way of describing the differences in the decision func- 
tions is to consider the approximation model Iri chapter 2, 

the observed variables y^ are regarded as being fixed and non-random, 
while the matrix X is considered as a random variable. In this case 
variations in the residual vector r^ are caused entirely by variations 
in X. 

In the system..of this chapter, the .observed y^. are regarded 

as random variables .and the matrix .X is fixed. .Here the y^ are 

utility estimates which, are' corrupted by 'noise'. Variation in the 

residual vector ^ is caused- entirely- by. variation. in the observed 

variables y. . / 

2 . • ' 

It -can be seen that regardless of whether the matrix X or the 

vector 21 taken to he the source .of. variability that the model 

» 

remains the same. In..either case a reasonable estimate of j3 is one . 

which minimizes the length -of the residual vector r_. When the vector 

21 is regarded as fixed, the decision function is often referred to as 

a discriminant fiincti on, and when the matrix X is fixed the decision 

function can be called an interpolation or regression function. The 

relation between approximation theory models and the pattern recogni- 

(so) 

tion process has been discussed by P.A.V. Hall 

The pattern recognition model used for document retrieval pur- 
poses here employs a linear decision function which is -actually a re- 
gression fxmction for predicbing document utility as a function of 



’extracted’ index terms.- -The training set document, utility estimates 
'are regarded as noisy measurements. To emphasize this, the decision 
function of this model will be referred to hereafter as an LUPF (linear 
utility prediction function). 

For reasons of -convenience , the test configuration uses an in- 
teger utility scale 'where y^eiljS,. . . ,9) and - t is .specified by the 
user. When = 1, the document has no utility to the user and when 
y^^ = 9> the document is most useful. The example problem presented 
later in this chapter-uses . a binary utility scale where y^e{+l,~l}o 
When y^ = -1-1. the document is relevant and when y^ = -1 the document 
is n on-relevant .' In this case the threshold T=0...Wote that when this 
.binary utility scale is used, -that the LUPF here becomes identical to 
the’ decision function of chapter 2. 


3.3 Implementing. the .Decision Function 
3.31 Direct Method 


Recall from section 2.4 that when a pattern vector ^ of 
unhnoira. classification is to be assigned to either 'category A or B, 
there are two eq^ui valent methods of making the. decision by using the 
index terms in the decision function (the extracted features) which are 
common to the pattern vector . 

The direct .method simply adds up the. ’weights '.of features in 
the vector z_ and compares the s\im to the threshold, after which the 
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vector X is put into the indicated- category , i,e., 

n 

/ -15.2;. > (t-'h ) 
d 3 - o 

d=l 

implies that the. pattern vector x .is assigned- to - category A. 

For the ..document, recognition system, the index, term veights are 
summed and. compared to the utility threshold t, after which the docu- 
ment vector X is classified, 

3,32 Indirect Method 

The indirect method-derives.matehing-.templates .by thresholding 
the decision function to form a linear pseudo-Boole an ineq,uality (LPBI). 
This inequality is solved -for its families of solutions. Details are 
presented in chapter 6. Each solution family becomes a matching tem- 
plate’. If -one of these templates matches the vector, x, then x is 
assigned to categorjc.A. Otherx-rise , x belongs to category B, 

For the document recognition. system,- the matching templates 
correspond to combinations of index terms. Observe that, the matching 
templates are equiv^en-b in form , and function to the, user’ s subjec- 
tively specified BBS . 

Thus, by considering document retrieval as a pattern recog- 
nition process, -V^e analyti cally . derive a BBS as a union of matching 
templates. This is an important result which allows the previously 
subjective BBS formation to be modeled as a feat^lre e>ctraction and de- 
cision operation. 
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To further illustrate this connection, .consider the. example BRS 
introduced in section 3.13. j Figure 3-1 shows how this subjective BRS 

j 

can be writiten as a union of solution families to some (unknown) pseudo- 
Boolean ineq^uality (not. necessarily linear, of course). Fig. 3-lA 
shows the. original BBS... -..Fig . - 3-rlB shows the reduction of the BRS to a 
union of solution families. Pig. 3-lC shows the solution families in 
tabular form." 

The solution families -which resvlt from. reducing a subjectively 
determined BRS to the form of Fig-. 3-lC are not necessarily mutually 
exclusive. For. example, any documents containing the combination of 
index terms given by 

T.>. (T^-,T2,T2,T^-,T^,Tg,T.^.) = (l ,0 ,0 ,1,1,0,0) 
r 

is covered by . both solution families' F 2 (t) shoi;m in. Fig. 

3-lC. The solution families. of an analytically determined BRS are 
mutually exclusive. This is important because no search effort is 
wasted by retrieving -the .same document with two different solution ‘fam- 
ilxes. ' 

3.33 Relation of Decision Function Inplementation to Retrieval System 
File Structure 

There are two basic methods .of organizing computer -files com- 
posed of index term ~ document number records. The first method is bo 
have 'the document numbers arranged in a sequential master list in mem- 
ory. Associated with,. each document number in. this master list is a 
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FIGURE 3-1 

\ 

■ EQUIVALENCE OF A SUBJECTIVE BRS TO A UNION OF SOLUTION FAMILIES 

A. Subjective BRS 

(( T^+T2+T2)*(T^+T^)) - (Tg+TyJ 

WHERE: T.= heat transfer 

» 1 

'E^= thermodynamic properties 

T^= thermal properties 

T^- gases 

gas flow 

T.= fluid flow 
6 

T.^= fluid properties 

B. Reduction of the Subjective ER3 to a union of Solution Families 


((T^+T^+T2)*(T^+T^)) - (Tg+T^) 

= ((T *T,)+(T-,*T^)+(T.*T,,)+(T-*T )+(T,*Tj+(T„*T_)) ~ (T.+T ) 

= (T^*T^*T,*T^)+. . -+(T2*T^*^g*T^) 

= (I^T^VP U U<V4V7> “ ff2?5V7>""3^'A7’'’'''3V6b> 

"jrj^Cnjn i?4®] o[f5®]D^3(T^ 


C. Solution Families in Tabular Form 
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.sttblist. containing the index terms lyhich belong to the fiocinnent. This 
■bype'of organization results in a sequentially structured file (SSF). 
(Sometimes this t-yps of file is called a linear file , ) 

To implement the decision function on an SSF, the master list 
of document huaiheics Is examined sequentially. The suhlist of index 
terms associated with. each document number is scanned to determine if 
any of the -’feature terms! are present. If so, their .weights are 
summed and the. result. compared to the thresholds- .All relevant docu- 
ments in "the file can be identified by repeating this, operation for 
each document number. in the master. list. . -It. is . also, possible to see if 
index term combinations in. each, document sublist match those specified 
by each template in the BBS. Thus for an SSF the relevant documents 
can bo recognised by. summing the term weights directly^ or^ by using the 
template matching -technique with a BBS. 

The major disadvantage of an. SSF. is .that all records in the 
file must be individually inspected to identify a verj'- small subset of 
relevant documents. The cost of searching an SSF increases propor- 
tionally mth the number of document records it contains. 

To reduce the unit cost of identifying relevant documents in a 
file, the file can be organized in a different manner. Here the master 
list is composed of the individual index terms in some order. Each in- 
dex term in the master list has an associated suhlist of document num- 
bers. Each document numbered in the sublist is indexed with the term 
in the master list. This tj*pe of file can be called an inversely 
structured file (ISF). 



To implement the decision function, on an ISF the matching tem- 
plates of the BBS are necessap'". Index tern weights cannot be applied» 
The individual BBS. templates ar-e matched by set intersection operations 
on all index teamis corresponding .to fixed indices in the solution fam- 
ilies. . The set operations are performed only on the sets of doeument 
numbers which axe associated with. index terns which, are ’features'. 
These featirre sets are a. small fraction of the total file. Thus the 
unit costs of recognizing patterns (relevant docunients) are lower in an 
ISF than in an SSP. However,, the increased search efficiency is off- 
set in part by the extra costs incurred by organizing the ISF. (The 
natural ordering is the SSF. ) 

■ 3 . 3 ^ Example Showing System Operation 

i 

. Figures 3-2 and- 3^3 -illustrate how the decision function is de- 
• rived and how the documents predicted to be. relevant, are identified 
using both a direct weigh.ted. term approach . and. the. BRS templates. 

Figure 3-2A shoves the matrix model which might arise from the 
selection of five index. terras as features. The training' set contains 
eight documents, with y^ = +1 for relevant documents and y^ = -1 
for nonrelevant docviments. The .relevance threshold x for this model 
is taken to be zero. The best approximate solution (in the sense) 

is shewn in Fig. 3-2B. This also shows the residual vector r with 



Figure 3-2C shows the decision function, or linear utility pre- 
diction ecLuation (LUFF), this function is thresholded (using 
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T— O) a linear pseudo-Boole an ineq.uality (LPBI) results which has six 
'.^solution 'fainilies as sho^m- 

Figure 3-3 shows all 32 possible combinations - of the five index 
terms which were extracted as features. The predicted utility of each 
combination is shown as it would be determined by a direct summing of 
the index term weights. This -approach might be tal^en i;-ith an SSF. 

The groups of combinations with u ^ 0 which are specified by 
the solution families (templates ) of the BBS are identified for com- 
parison. This approach to identifying relevant documents would be 
taken with an ISF. 



FIGURE 3-2 

SAMPLE PR0BLH4 ILLUSTRATING DERIVATION OP THE DECISION FUKCTION AND BRS 
A. Matrix Model Arising from Training Set of Doctunents 
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B. Best Approximate Lj Solution b and Residual Vector r 



C. LUPF, LPBI AND BRS 
LUPP: u = T^ ~ Tg + Tj^ - T^ 

LPBI: u ^ T = 0=#-T^ - Tg + Tj^ - ^ 0 
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PREDICTED UTILITIES FOR COPBISATIONS OF INDEX TERMS 


Combination 

number 


Index term 
configuration 


Predicted 

utility = u ) 

■ 1 

^1 

1 

^2 

1 

T3 

1 

^4 

1 

^^5 

1 

0-^ 

"2 

2 

1 

1 

1 

1 

0 

1^ 

3 ■ 

1 

1 

1 

0 

1 

-1 


h 

1 

1 

1 

0 

0 

0:3 F^ 

5 

1 

1 

0 

1 

1 


6 

1 

1 

0 

1 

0 

T 

1 

1 

0 

0 ■ 

1 

-1 


8 

1 

1 

0 

0 

0 

0^3 Fo 

9 

1 

0 

1 

1 

1 

l'-l 


10 

1 

0 

1 

1 

0 

1 


11 

1- 

0 

1 

0 


0 


12 

T_ 

0 

/ 1 

0 

0 

1 

j. Solution 

13 

•1 

0 

' 0 

1 

1 

I 

1 fami-lxes 

l 4 

1 

0 

0 

1 1 

0 

2 


15 

1 

0 

, 0 

0 

1 

0 


i6 

T 

JU 

0 

0 

0 

0 

iJ 


IT 

0 

.1 

1 

i 

1' 

•-I 


18 

0 

1 

1 

1 

0 

■ 0^ ^’6 

19 

0 

1 

1 

0 

1 

-2 


20 

0 

1 

1 

0 

0 

-1 


21 

0 

1 

0 

1 

1 

-1 


22 

0 

1 

0 

■ 1 

0 , 

o,::3 F^ 

• 23 

0 

1 

0 

0 

1 

-2 


2k 

0 

1 

0 

0 

0 

-1 


25 

0 

0 

1 

1 

1 

o'' 

1 • ^4 - 

26 

0 

0 

1 

1 

0 

1-^ 

2 T 

0 

0 

1 

0 

1 

-1 


28 

0 

0 

1 

0 

0 

ol; 

F. 

29 

0 

0 

0 

1 

1 

‘’3- F, 

30 

0 

0 

0 

1 

0 

1^ k 

31 

0 

0 

0 

0 

1 

- 2 . 


32 

0 

0 

0 

0 

0 

o::d F 5 




53 


it.o Alf IKFOmiATIOK IHEOKETIC MEASURE FOR RAHKIHG 
MD SELECTING INDEX TERMS 

4,1 Introduction 

An information theoretic measure of goodness is developed for 
ranking index terms found in a training set of doeutaents . Each index 
term is regarded independently as a potential ’experiment’ which can 
"be used to predict the relevance of documents in the training set. 

For example, knowing that there are ’20 relevant and 30 non- 
relevant documents in a training set , hut lacking any other informa- 
tion, a decision maker if presented with a document selected at random 
from the training set , would assume that nhe prohahility of the docu- 
ment being relevant (before he examines it) is ‘0.40'. Suppose now^ 
that before inspecting the document and making his decision about rel- 
evance, the user is shown one index term associated with the document. 
If he knows that this term occurred with 20 of the training set docu- 
ments and that 15 of these 20 were relevant , then the user would be 

t 

justified in concluding that the probability of the document being rel- 
evant is 0.75- 

Knowledge that the particular index term was present has pro- ' 
Tided Information (or resolved uncertainty) about the classification 
of the docmnent .• In fact in will provide (on the average and for this 
example using the above data) O.IB bits of information each time it is 
found with a document. The development -of 'tliis ■•■liuautitative meas- ■ 
ure of information- (divorced from economic considerations) will be 



pres'ented -Here. This measiore' is used to select the test index terms, 
i.e., ’those .terms which individually. provide the laost information 
about, document relevance.- ’• . . * ■- 

4.2 The Decision Theory Model 

A simple decision theory model is sho^ra below (see Hadley 
or Fishbum^^^^ for a more thorough discussion). 

p(x^) pCxg) 



^1 

^2 

• • • 

X 
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\l 
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• • • 

^n 
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^22 

« * » 

'^2n 


. 
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• 

" •• 


• 
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^rl 

^r2 

• * • t 
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rn 


There are n 'states of nature’ or possible outcomes x. . 
^=1,2,. . . ,H which are relevant to the decision maker's problem. The 
probability distribution p(X) = {p(x^) ,. . . ,p(x^) } over these states 
of nature is assumed known to the decision maker. A random experiment 
is performed which determines which state of nature ,x. actually holds. 
The results of this experiment are not available to the decision maker. 

The decision malcer has a set of r possible actions a^, 
i=l,2j...,r which he can take. One and only one of the actions 


must be selected. 




55 


After the action has been selected by the decision maker, the 

true state of nature x. is revealed to him. He -will then receive the 

1 ! 

revard ’u. ¥hich may be negative, (u. . is a utility , which Includes 
monetary as well as more subjective rewards.) 

The decision problem is solved when the decision maker chooses 
an action. Tiie best action a^ is one which maximizes the expected 
utility j i.e. 


maJ ) u p(x )\ 


k.21 Decision Problems with Experimentation 


J-i. ildS/VCO. V..A J-OXa* VA UAiN-* V— . — ^'Us...w^ 

I 

above is to allow the decision maker to perform an auxiliarj^ experiment 
before picking an action' . Recall that the state of nature x. has 
already heen determined, but the results are ■unkno’tm to him. This ex- 
periment can be considered to be an attempt to gain more information 
about the true state of nature , 

Define Y = {y., ,y ,7^, } as the event set for the experiment 

X il b 

performed by the decision maker, i.e., these are the only outcomes. 

It is assumed that the conditional distributions 


p(Y/x ) = {p(y, /x. ) ,. . . ,p(y„/x )}, j~l,2,...,n 
j ^ J ^ J 


are known to the decision maker, as well as p(X) - {p(x^) ,. . . ,p(x^) } 
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4.211 B^yes Riile . It is a trivial consequence of the definition of 
''conditional probahilities 

) 

■ r pTx-T)- 

IJ 

that we are able to write 

. P(x.,yj^) 

iV • 

Thus 

p(x^ 

a 

Now using 

p(yi^) ^Y. =2_, 3 

j j 

we have 

p(x )p(y, /x ) 

p(xj/y^) ^ » 

j 


I 


p(V^j)p(Xj) 


= p(x,/y^.)p(yj5.) 


• ■ s 

_ ! 3 .... T3(y /x, ) 





5T 


/go) 

This last expression is known as Bayes Rule^ . p(X/y, ) = 
•{p(xj/y^), k=lj...js} is a new probability distribution 'over the n 
states of nature. 

The interprebation here ia that for any particular observed ex- 
perimental outcome y^^j an entire new probability distribution p(x/yj^) 
may be constructed. Since the experiment has S possible outcomes , 
there are S possible new distributions which may be derived. 

To distinguish between the initial distribution p(x) and the 
distributions p(X/y ) derivable after the experimental outcome y, 
has been observed, it has become customary to call p{x) the prior 
distribution and p(X/y^^) -the posterior distribution . 

To perform the transformation from prior to posterior distri- 
butions , it is necessary. to know both the prior distribution p(X) and 

the conditional distributions. p(Y/x.), j.=l ,2, . . . ,n. This know.ledge is 

. J 

ecLuivalent to knowing the. oint distrihution 


pCyt-jX ) = p(y.^/x )p{x ), j=i 
•K- d k j j. 


,2,. 


,n, k=l,2,. 


,S. 


After the posterior distribution p(X/y ) is determined 

ic 

used in place of the prior distribution to determine the action 
having the maximum expected utility, i.e. 


5 it is 
^a(k) 


^A(k) 


max 

i<r 


{I 
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The experiment has allowed a better , more up-to-date estimation of the 
state of nature. 

^.3 Selection of Experiments 

*• 

The purpose of the experiment performed by. the decision maker 
is to provide more information about the true state of nature. The in- 
formation is conveyed. by permitting a revision of the probability dis- 
tribution over the state of nature from p(X) to p(X/yjj). 

In many problems , . the decision malcer can choose from a group of 
experiments only_one which will..be performed.to. obtain.. p(X/y^) . This 
raises the interesting question of which experiment is 'best'. That 
is, how can experiment 'goodness' be defined, to permit, a .ranking of all. 
availab le expe riment s ? 

4.31 Decision Theory Approach when .the Utilities are Kho™ 

In the context of the decision model discussed above, "krhen the 
utilities u.. are kno™, the answer is to pick the experiment which 
maximizes the expected utility averaged over all possible posterior 
distributions . 

For each experiment, consider .each outcome in turn and 

using the associated posterior distribution p(X/yj^) determine the 
maximum utility which will result from making the best decision, using 
this distribution. Then weight bhese utilities by the marginal prob- 
abilities that the outcomes will occur. This gives the ex- 

pected utility for each experiment assuming the best decision is always 



made for eaeli possible outcome. Finally, the 'best' experiment is the 
one mth the highest average utility (avera,ged over all possible pos- 
terior distributions). 

h.32 Inadequacy of the Decision Theory Model when the Utilities are 
J^ot Known 

There are at least three situations which frequently arise and 
make the above procediires inapplicable. 

(a) The utilities are all equal. In this case the expected 

% 

costs of all actions are equal and a best action cannot be chosen. 

(b) The utilities are unknown, or fluctuate to such an extent ■ 
.that they c.^ be considered to be unknown , 

(C) The .utilit hot exist , but a prior distribution can 

be postulated; and various observed variables can give rise to pos- 
terior distributions. 

Situation (b) above might occur for example, where a local de- 
cision problem exists 'within a large system. The global utility of 
selecting various local experiments is not estimable in this case. 

Such types of situations are felt to arise frequently in design prob- 
lems, where small portions of the overall system are designed inde- 
pendently of the others, 

Situa,tion (c) arises most often from a purely analytical situ- 
ation where no utilities are associated with a choice of experiment. 

All three of the abo've situations negate the selection of 
information-gather 5 .ng experiments by using an expected utility measure. 
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However, the fact that experiments do provide information remains, 
whether or not an economic value can he attached to the information. 

The process of index: term selection can he modeled in the con-' 
tex± of a prior distribution which is modified by experimental informa" 
tion to give posterior distributions. However, utilities are not 
easily defined. 

For the evaluation of these 'processes ■^■n.thout attaching an 
economic measure, we turn now to information theoiy. 


1 

. Besults .from Information Theory 


4,Ul Definition of Entropy 


a definition, let 


n 


H(P) = H(p^,..,,p^) = -C In Pi 

.. .. i=l 


be called the entropy of the probability distribution 


n 


P = {p 


l^Pg.' 


r 


,p }; -where ) p. 


= 1 


’ Pi ^ 


> 0 


i=l 


The functional form of H(P) is determined up to a multiplica- 
tive constant by speeifj,’’ing the three conditions given below. 


Analj'tical developments presented here closely follow those 
presented by A. Feinstein(SiO . As a secondary sour-ce, see S, 
Watanabe(65; . 
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(A) H(p,l - p) is a continuous function of p for 0 ^ p ^ 1. 

(B) H(P) is a symmetric function of all its variables. 

(C) If Bn ~ ^ ^ then 


= H(p^ ,p^ , • • = ,p^) + p^: 


H 


i ^n ^n 


By agreeing to-tahe logarithms, to the base 2 and by setting 
C=l, the units of information become bits. We shall denote this by 
writing 


n 

H(P)' ^ Pi 

i=l . • 

. / ■ 

with the understanding that 'O log 0 = 0. 

1 ( 

It is possible to prove the following two important results' ' 
given below. 

(a) The. entropy Il(P) is bounded. That.. is, 0 £ H(P) _< log n 
with H(P) = 0 iff p, = -1 for some .k, and H(P).= log- n iff p. = 1/n 
for all 0 " 

(B) H(P) is strictly concave^. 

Result (a) has an intuitive interpretation when the entropy is 
regarded as. the -on certainty in the probability distribution P. 


1 ™ 


This follows from the fact that K = -p log p is strictly 


< 0 for p ^ 0. 

dp 


concave: 
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Let pj^ = p(xj^) = 1; pj = 0_, j k. In this case^ event 

is a certainty , and the entropy is zero. Let p. = l/n; . ,n. 

J 

In this case ail events x^ are equally uncertain aiid the entropy is 
a maximum. 

By result (B)^ the function H(P) smoothly approaches its single 
maximum value. Intuitively^ this allows us to rank all probahillty 
distributions without ambiguity according to their entropy, in the 
sense that distributions with greater entropy are always closer to the 
maximum entropy distribution given by p. = l/n. 

Figure 4-1 shows the_ entropy for the two state di stribut j.on 
Pl + P 2 = 1; ^ ^5ie maximum entropy of one bit is attained 

when Pj_ = P 2 = 1 / 2 . The maxinuma is fairly broad. 

4. 42 Definitions of Event Sets and Probabilities 


Let X = {x^, ...,x^) and Y = {y 2 _^y 2 ^ • • • two finite 

discrete sets of events. Denote by X(^Y the product set consisting 
of all mn pairs ( X]_, y^ ) . 

Assume that there is a probability distribution defined over 
X(^Y. with probabilities denoted by p(xj, yj). This .is the 'joint 
distribution of X and Y, p(X,Y), where 


...,n; j=l,2, ...,m 


n m 



P(x±:>yi) = 1* 


i=l j=l 
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bits 


FIGURE 4-1 

EKCROPY PLOT OF A SIMPLE BINARY DISTRIBUTION 
AS A FUNCTION OF ONE PROBABILITY 


H(pi.P2) 



where H(p^^Pg) = logg Pp Pg logg Pg] 

Dj_ t pg = i; Pp^Pg ^ ® 



6k 


Let the marginal probahilities be gi*ven by 


'ra 


p(x. ) 

^ 3 . 


p(x ,y }, 

X J 




and 


n 


p(y.) = 


pfx »y ) , j=l52,- ' ' ,m. 
^ J 


i=l 


Then denote the marginal - distr-ibutiona by p{X) and p(Y) 
Define conditional probabilities as 


p(x. ,y j 


and 


p(x, sy . ) 


Then let the conditional distributions be given by 


J 


and 


p(Y/x^), i=l, 2 ,»' “,h. 
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4.4s Entropy of the Distributions 


It is useful -to define the entropies of the joint distribu- 
tions, th^ marginal distributions and the conditional distributions as 
shora below. 

n m 

(a) H(X,Y) = - > / n(x, ,yj log p(x. ,y.) is the entropy 

L 3. J 3. J 

i=l j-1 

the joint distribution, 

(b) The entropies of the marginal, distributions -are given by 


of 


H(f) 


= - log 


and 


h(y)= - 2_p(y.) log p(y. )' 
. J J J 


(C) Define the. entropy of each conditional distribution as 


n 

H(X/y. ) = / p(x. /y.) log p(x./y ); j=l,2 , * ' ‘ ,m, 

J r ^ J J 

i=l 


Then the, average entropy of all conditional distributions is 


defined by 
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m 


e(x/y) = 


p(y,)H(X)/yJ 
j ^ 


j=i 

ra 


ra n 

" p(xi/yj) log p(x^/yj) 


0=1 i=l 


m n 


" Z^ P^^i^yj) log p(x./y^). 


0 1 


4.44 Useful Relationships "between Entropies of Distributions 


The relations shown below for distributional entropies can be 
proven by using the previous definitions: 


tr\ 

1 I A ^ T t 


== n.\L} -h nvA/i; = nv-V + 


(4-1) 


H(X^Y) < H(X) + H(Y) 


(4-2) 


with equality iff p(X) . and p(Y) are statistically independent. 


0 < h(x/y) < H(X) 


(4-3) 


R = H(X) - H(x/y) = II(Y) - H(y/x) > 0 (4-4) 


R = H(X) + H(Y) - H(X,Y) 


(4-5) 
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4.5 Interpretation of Information Theoretic Results 
s 

• I 

4. 51 Bayesian Interpretation 

The above results are all we need to describe information in 
quantitative j non-economic terms. 

Intuitively^ the entropy of a 'distribution represents the un- 
certainty in the distribution. If we revise the distribution from 
prior to posterior through Bayes rule after observing the results of 
an experiment how does the entropy change? 

By letting H(Oi) be identified with the uncertainty in the 
prior distribution_, it follows that H(x/yj) is the uncertainty in the 
posterior distribution obtained from Bayes rule after observing one 
particular experimental oirtc.Lme y<; Since there are m 

possible posterior distributions^ it^ is reasonable to define H(X/y) 
as the average uncertainty over all posterior distributions. 

It is customary and intuitively pleasing to define a decrease 
in uncertainty (entropy) as in increase in information, or 
I = AH = Hp.- Hq. This allows the amount of information gathered 
to be measured in bits. In this sense then, R = H(X) - H(x/Y) is the 
measure of information provided by the experiment . From (4-4), this 
information will always be positive. Each time the experiment is per- 
formed R bits of information (on the average) are acquired. If the 
experiment is very good, H(x/l) = 0 and the posterior distribution has 
no uncertainty. Here R = H(X) and all the uncertainty in the prior 
distribution has been removed by the experiment. If the experiment is 
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very poor, then H(X/Y) = H(X) and no information has heen provided hy 
the experiment. In this case R = 0. 

Of course the amount of information which can -be provided hy an 
experiment is limited hy the amount of uncertainty contained in the 
prior dlstrihution. Thus for a Riven prior distribution , the best ex- 
periment is the one with the largest value of R. To., compare experi- 
ments in decision problems with different prior distributions it is 
convenient to define a dimensionless figure of merit 



H(>X) 


where 0 <. a £ 1. PCT = 100a is the percent of uncertainty in the 
prior distribution which is resolved .by the experiment. PCT = 100 im- 
plies a perfect experiment and PCT = 0 implies a worthless experi- 
ment . 

■r 

'■ Relation .states that .‘the goodness, of an experiment can 

also be measured by R = H(y) - H(Y/X).. Here H(Y) is a function of 
the experiment alone. H(Y/X) is the average uncertainty in Y, if X 
is known beforehand. R = H(Y) - H(y/x'^ is the amount of information a- 
bout Y which is acq.\aired from knowing X. This expresses an informa- 
tion balance^ . The amount of information contained about X in Y 
is equal to the amount of information about Y in X, 

From. ( 4 - 4 ) it is clear then that the goodness of an experiment 
can be inferred from either the average amount of information provided 
by the experiment as to the state of nature , or the average amount of 
information provided by the state of nature as to the outcome of the 



69 


experiment. This is simply the strength cf the statistical dependence 
betveen cause and effect , or effect and cause. Frcin (^1-5) and (4-2), 
if cause and effect are statistically independent, R = 0, 

The interpretation cf cause and effect relationships is dis- 
oiissed in depth, by Watanabe^^^^. His conel''aslcns regarding interpre- 
tation cf entropy. expressions are similar to those presented here, fie 
defines the inferential process- of looking ahead, from a knoim state of 
nature to the uncertain outcome of an exp)ei-iment as being predi ction ■ 
and looking backvard from a known experimental oiitcome to the laieextain 

vV' • ' 

statS''' of natiire as being retrodi ction , 

4.52 Communication Theory Interpretation 

The decision theory interpretation of entropy reduction by per- 
forming an experiment is .not the customary way. to interpret relations’ 
(4-l) throng (4~5)» Communication engineers prefer to interpret the 
same results in terirs of an information (or symbol) .transmitter, a 
noisy channel, and. a receiver, as shown below^^^^. 



Here discrete symbols are ‘dratm randomly frcm a probability 
distribution p(X) having entropy H(X) , and are transmitted sei^uen- 
tially (as drawn) through a noisy channel. A distorted message is re- 
ceived, where distortion implies that seme of the symbols are changed 
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by the noise into different symbols. A correcting device attempts to 
infer what symbol was sent^ on the basis of what symbol is received. 
H(X/T) is the I'e^idual entropy associated with the message received 
after the correcting device has 'cleaned up' the noisy message. H(x/l) 
is referred to as the equivocation of the channel with respect to the 
source distribution p(x) . It represents the amount of information 
lost (not recoverable by the correcting device) in the channel. 

R = H(X) - H(X/y) is the amount of information transmitted through the 
noisy channel. 

Both the decision theory and the communications theory inter- 
pretation of information theoretic expressions have merit, depending 
on the problem at hand. 

/ • 

4.b5 Compunation of an Information Statistic R 

I 

For computational purposes, consider a decision problem with 
two states of nature, and an associated experiment vrith two outcomes. 
After observing the true states of nature and the corresponding exper- 
imental outcome for several trials, it is possible to summarize the ob- 
servations in the sample contingency table of Integers shovm below. 
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There is a large body of literature which deals -with, the statistical 
theory of contingency tables. See for example Kullback^'^*^\ However 
(4-6) above will be considered here simply as a convenient tabular 
data array. Data in (4-6) will be used to .estimate E. 


Let ■ R be a sample estimate of R based. on the observations 
in (4-6), F. will henceforth be called the information statistic. In 
can be computed directly from either (4-4) or (4-5). However, it is 


easy to derive a more convenient computational form. To do this, first 
define a contingency table of probability estimates’ (the joint distri- 


bution p(X,Y) as follows: 


^ 1 - ^2 


x.| a 2 a + g 

e = 


Xg Y 1 d + (S 

Y = 


a + Y B + 6 IvO 

6 = n^^/H 

(4-T) 



T2 


Then-: 

^ A A /v 

R = H(X) -i- H(Y) - H(X,I) ■ (^-5) 

= - (a -t- 3) log (a -i- B)'“ {l + 6) log (Y + o) - (a y) log (a + y) 

- (3 6) log (3 •<' 5) a log a -1- 3 log 3 -i* Y log Y + 5 log 6. 

Collecting all terms in a, ?>}.T, and 6 gives: \ ■ ■■ ) 

R = o,[- log(a -i- 3) - log.(o, y) + log .a] 

+ 3[- log(a + ^) ~ log(3 + 5) t log p] + rL-.log(a + t) - log(r -J- &) 

+ log r] + 6[~ log(p + 5) “ log(r + 6) log 6] 


= a log 
+ Y log 


(a-l- 3)(aH- Y-rr ^ 


f** 

7— 7?7— r ^ logl-T 


V O' *r Y / ( Y 6 J 


J 



or, in terms of the integer counts 


21 


log 

^“^11 

t"ll * “l2> 



-log 

21 

(n_^ + n„^)( 


r ’^^12 

"“12 * ■" "22^ 


-i- log 



Nn^2 

1 

1"^21 ^ 

"22 

♦"22>i 


• 2 


Since n. , = > n. . and n. = ) n. . 

'J ^ ij 

i=l 


2 2 


ne get : MR 


' j=l 

r Nn. . 

I iiL 


L L "io 


) 


(4-8) 


i=l j=l 


This gives a convenient computational form for the information 
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statistic R. However when a = fi/H(x) is to he computed, direct use 
of (4-4) is recommended 5 since H(X) is produced as a byproduct. 

I 

If R is the estimated number of hits of information (on the 
average) which are provided each time the experiment .is performed, then 
NR is the total number of hits of information provided hy all the K 
replications of the e3q)eriment. 

There is another interpretation of the information statistic 
based on (4“8).. Suppose the sample contingency table arises fx-om com- 
paring a (0/1) vector x (two states of nature, zero and one) with a 
(O/l) experimental outcome vector (two experimental outcomes, zero 


and one). The similarity of vectors JL 3L intuitively high if 

= y^ = 0 or 1 for a large number of indices i; Of the four terms 
in the expression (4-G), two involve a. . on the main diagonal of the 

3. J 

^1 

table, and two involve n. . off the diagonal. The sum of the diagonal 

^ J 

terms of (4-8) represents the measure of similarity between the vectors 
X and 21s while the sum of the off-diagonal terms is a measure of 
their dissimilarity . 


4.54 Statistical Distribution of the Information Statistic 

Since R is a statistic drawn from a sample, it can be ex- 

(71) 

pected to behave as a random variable. It is knoim that 

[log^ 2]2NR 

is asymptotically distributed as a central chi-squarad variable tdxh 
one degree of freedom (for a 2 x 2 sample contingency table) under the 



null hypothesis that E = 0. The factor log 2 = 0.693 is needed 

e 

■because E is assumed-to have the units of hits in (4-8). 

4.55 Exaniple Problem 

As an example, consider a training set of 28 documents. A set 

of 15-5 index terms were found -mth this document set. An estimate of 

the information provided about ’document relev'an'ce by two of -these ■ • 

terms will be made to illustrate previous results .' 

Vector X = i-1,2, • • • ,28, of Fig. 4-2A shows the correct 

classification of each' of the 28 documents in the training set, with 

X. = 1 if document i is relevant. Vectors = (t.,) and ^ = 

1 ~1 il '^d 

(t.^) of Fig. 4-2A show how terms 1 and 2 are used to index the 28 

it * 

documents: For example, if ■ T , . =1.5 then index term 1 is used' to 

index document i. 

It is- possible to- compare the effectiveness of terms 1 and 2 
as relevance indicators (over the training set) by comparing vectors 
and Tg separately with vector . x^. Fig. 4-2B. shows the results 
of these comparisons expressed as 2 x 2 contingency tables. Calcula- 
tions leading to .and .a^ are detailed in Fig. 4-2C. Equation 
(4-4) is used for K instead of (4-8) because -H(X) is generated as 
a byproduct v/ith (4-?4), and- H(x) is required for a = R/Ii(x) . Eig. 
4~2C shows the estimated marginal and conditional distributions and 
their corresponding entropies. It can be seen that term 2 (a^ = 
O.OTSo) is estimated to be slightly better than term 1 (a^ = O.O 7 OI). 
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FIGURE h~2 

EXA{«=I,ES ILLUSTRATING COI-SPUTATION OF M IlTFORt-SATlON STATISTIC FOR ESTIMATING INFORMATION 
ABOUT DOCUMENT RELEVANCE COKVEIED BY INDEX TERMS. 


A. Vectors for Comparison 
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X. 
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1 
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1 
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0 
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1 

1 

0 

5 

0 

0 

0 

6 

1 

0 

0 
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0 

0 

8 

0 

1 

0 

o 

0 

0 

1 

10 

1 

0 

0 

11 

0 

0 . 

0 

12 

1 

0 

0 

13 

1 

0 

0 

lit 

0 

0 

0 

15 

0 

0 

0 

l6 

0 

i 

"o 

17 

0 

0 

0 

18 

0 

0 

1 

19 

0 

1 

0 

20 

0 

0 

0 

21 

0 

1 

— 0 

22 

1 

1 

0 

23 

1 

0 

0 

2k 

0 

•1- 

0 


1 

0 

0 

26 

0 

1 

0 

27 

0 

1 

0 

28 

0 

1 

0 


B. Contingency TaLles for Comparing X vith i, ana T, 



t. =0 
1,1 

t. =1 
1,1 


o 

It 

X 

9 

9 

18 

X^=l 

8 

2 

10 


17 

11 1 

28 



"i,2'=° 



^i=0 

15 

3 

18 

X.=l 

1 

10 

0 ‘ 

10 


25 

3 

28 


C. Computations^' 


X with 


X wi th Tp 



(0.6k236, 0.357llf) 

(0.64286, 0.35714) 


0.9lt027 

0.94027 

p(x/t^=0) 

(o„ 529^)2, 0. 47058) 

(O.6CO, 0.400) 

H(X/t^=0) 

0.99749 

0.97096 

p(X/t.=l) 

(0.81818, 0.18182) 


H(X/t.=l) 

X 

0.68402 


p(T) 

■ (0.6oti4, 0.39286) 

(0.89286, 0-. 10714) 

H(X/T} 

0.8743 

0.86694 

R=H(X}-R(X/T) 

0.06593 

0.07333 

a=RVH(7j 

0.0701 

0.0780 


*p(0 is the probability distribution and 
H(-) is the distribution entropy 






















5.0 SOLVIEO.THE,DrSCRIi3?E LIFEAR APPR0XI14ATI0N ?R0BLH>J 11^ THE L, 


JiOIU'S 


5.1 Introduction. 


The. discrete, linear approximation, model can be vritten as fol- 
lows' 


y = X& + r. 


The model can also be written as 


y. 


n-1 n-1 

■ “ 5 i^ljSj- • • jm. 

1. o Z_ j'lj /■ , .V 1.1 


J=1 


j=0 


The Ime S i” appjrv,»xxiiiatron problerii arises when estxjiiates of the unlknCim 

I 

vector j3 are desired. We define a best estimate of _§_ to be the 

•Jf 

vector b which minimizes the length of the residual vector r_. If 
we designate the length of the vector £ by j jr| ] > called the norm 
of r, then our approximation problem becomes: 


Find b such that 


^ = min [ j^j 

b 


= min I 1;^ - 
b 


A class of nozms is given 

■ lldl 


\ 

!P 


,U/p 




for 1 < t) < <» 
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When p - the familiar least squares problem results. The 
cases where p = 1 and p = “ are also of practical interest because 
algorithms are available to compute In par;sicular5 they may be 

formulated as linear programming problems and be easily solved. 

corresponding to p = 1 gives a fit which minimizes 
the sum of the absolute values of the residuals r^, i“lj2,*-*5n. 

L W corresponding to the limiting' case L (r) = lim L (r) = 

CO CO — *n 

1 I 

max jr^i gives. a fit which minimizes the largest residual (in abso- 
l£i^n 

/ - ♦ * 

lute value). The L norm is also often called the uniform or 
Chebyshev norm. 

The. .L^ and solutions will always exist when computed 

using the linear programming formulation j even when the rank of X is 
q' < n« This makes the and norms attractive when dealing 

with data matrices which are not kno"!ra beforehand to have rank q = n. 
The Lg (least squares solution) normal equations do not have a solu- 
tion when q < n.. 

For the application considered here, the approximation problem 
arises when' index term 'weights ' are to be derived for estimating 
document utility. The matrix X is not kno™ beforehand to have rank 
q = n. The norm is used here to estimate the index term weights, 

and no problem is encountered if q < n. In addition the solution is 
very rapidly and conveniently attained with the linear programming 
formulatdon. Formulation of the problem as a linear program is 

briefly reviewed below. Example problems are used to illustrate the 


development . 
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5.2 Fomulating the Discrete Prohlein as a Linear Programming 

Problem 


Formulation of the problem as a linear progrs.mming problem 

( 7 ^ 4 - 75 ) (l6) 

has been sho™ by I. Barrodale \ and P. Eabinowitz . The for- 
mulation proceeds as follows : let 


^ + r . 


WoWj since ^ and r_ are unrestricted in sign j 'they can each be ex- 
pressed as the difference between two non-negative vectors 5 i.e. 


+ — 4 - — 

— 2 . 

Z - - 3,” ) i ) ; 

(x|-‘x|i!- i)/A = z- 



These equations can be regarded as the constraint set for a linear ’ 
programming problem. The unknowns are the vectors ^ ^£. 

distinction made in section 5.1 between the unkno'sm vector jB and its 
optimal -estimate b* has been dropped here to eliminate notational 
complexity. All vectors 3. appearing as the unknoi-ras in LP problems 
are to be considered estimates of the true vectors. 

The objective fxinction can be formulated by observing that unit 
vectors corresponding to r^ and r. will never be in the basis at 

3 . 
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the same time, since they are linearly dependent, (the same remarks ap- 




Kadley^'^'^^), The soluti 

I 

;-hen represents the ahsolnte va2.ue of the ith residual, since: 


ply to ^ and 3^, see 


ion variable r. + r. 

1 1 


either 

r* = |r.| 


and 

or 

r7 - |r.| 

X X ' 

^ 0 

and 


By putting zero costs in for the unknowns and 3^ and unit costs 

in for the unknowns rT and r7, the sum of the absolute values of the 

X 1/ 

residuals is minimized. This gives the linear programming problem 
shown below. < 


n n 

\ + 

v>T ■m'l TO f7 3 ry Q mlr 

m m 

\ ~ “ 

\ 1 J. \, T -V. 


/ X X 



* 

i=l i=l 1 

i=l 

X=1 


subject to ■ (X[- X|l|- l)/i^\= 21. J 

3’^, 0. 

(5-1) 


+ 



4 * 4 * — 

After solving the problem, form ^ ^ - 3. and £. = £. - £. to re- 
cover estimates of the parameter and residual vectors. Tlie optimal 
value of the objective function is the minimal norm. 

The size of the constraint set in (5-l) is m rows by 
(2m + 2n) col\3mns . By transforming some of the variables, 

Barrodale^ ' shovrs that (n - l) columns of the constraint matrix can 
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be eliminatedo To see this, let + r_ 


n 

■ V“ 

or y . = ) 8 . (|) , . + r . , as before . 
0=1 


'IKfoi? instead of writing the unrestricted 8. as the difference o 

0 

non-negative con^onents as before , define 


and let- 


G?hen 


u = max 

j 



^ o =!^|- U _< 8 



•i- u > 




Finally define 


which gives 


L = 



0=1 


- ux + Ir^ " 




f two 



8l 


for the constraint set. The complete problem becomes: 


minimize 


z 


n 



i=l 


o*u + 


m 

E 

i=l 


+ 

l-r. 

1 


m 

r" 

i=l 


subject to 



4“ 

’ s51 ^ 
. u > o 


(5~2) 


The vector ** x .replaced the submatrix. - X. in. the const X'ained 
matrix for a net savings of. n - 1 columns. 

Kow solve (5"2) for a,u,^"^,^ . Then ^ “ r^ gives the 

residuals, while the parameter estimates.. are given by g. .= a. - u. 

J w J 

The length of the residual vector Cin the sense j is given by t;he 

optimal value of. the objective function, as before. 

Two conmients.can.be made which -apply to either (5-1) or (5-2). 
The LP ‘.px-oblem has. no-Phase l! .Because-a unit - matrix exists in the 
constraint matrix, there is an initial, basic feasible solution. This 
implies that there is always an optimal basic . feasible solution . Fur- 
thex'more, the existence of. this .solution does not depend upon the rank 
of the matrix X, 

Alternate optimal, solutions may. exist. .More .will be said about 


this later. 
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5-3 Solving, the Problem 

The problem of determining index term weights was set up 

and solved using (5"l) instead of (5-2). Althougji ,(5-2) is more ef- 
ficient j it was unknown to the author at the time the computer program- 
ming was done . 

The approximation probTem is solved here using three subrou- ■ 
tines, one of which is a general purpose SIMPLEX routine. ' (Barrodale 
has developed one specialized routine for the problem). A Fortran 

IV subroutine for linear programming written by E. J. Clasen^"^^ is 
used to solve the LP problem. A driver subroutine Toads the struc- 

t 

tural matrix A using the data matrix X, loads the right hand side vec- 
tor b, using the known dependent variable vector and finally loads 

I 

the cost vector c_' , which depends only on the structure of the problem 

i 

and not on the data. 

After the A,b,e data have been loaded by- the subroutine, the 
resulting LP problem is solved using the Clasen subroutine. The solu- 
tion to the LP problem is related to the solution of the approximation 

problem by using a follower, or interpretive subroutine, which recovers 
. / 

the unrestricted (as to sign) variables 3, from the optimal non- 
• 3 

negabive solution variables a.; and . u of the.- LP. problem. 

cJ 

Computational experience with the solution of L^ problems for 
index term weights has shown that the program is q^uite fast. For typi- 
cal problems having 25 rows and J2 columns the average solution time 
was 3.0 seconds, while for larger problems with 50 rows and 122 ;■ 



col-umns , the average solution time -vras 6.0 seconds. This is for the 
IBM T09VT0^^ direct coupled system. 


5-^J- Example Prohlems 


Figure 5-lA shows the initial full simplex -uableau which re- 
sults when the problem presented as an example in section 3 . 3^4 is 

set up as an LP problem using formulation (5-l). The submatrix X of 
Fig. 5-lA is the same as tlie matrix Z of Fig- 3-2 5 except that the 
columns of Z have been permuted to form A. This does not effect the 
problem solution in any .way. This same permuted version of Z also 
appears as matrix X of Fig. 5-2A and Fig. 5-3A, •• To identify columns 
bf'”X- with columns- of ' Z • -the' following table is convenient: 


-.-...-yariaDues 

1 ^ 


h 


P4 


Column nxunber 

Z 

l'~- 


'3' 

4 

5 

6 

cross references 


1 

■ 4- 

■6. 

5 

2 

3 


Figure 5-lB shows the optimal tableau for this problem , and 
Fig. 5-lC gives the solution 





-4- ITj^ - 



which is reconstructed from the optimal LP solution. 

The optimal tableau of Fig. 5-lB indicates that an alternate 
optimal solution is present. Columns indicated with an asterisk are in 
the optimal basis, while coliimns paired with the basis colurans are 
marked with ^P’ . (Recall that all columns in the structural matrix A 




FIGURE 5-3 

SAMPLE Lt problem - FOW-IULATION {5-1) 

A. Initial Tableau Shoving Input Data 
A » (x|-x|l|-l) ~ structural matrix 
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C. Solution Into)*pretation 
- 0=0 


‘2 




rg . 0 - 1 = - 1 


1-0 = 1 


= 0 - 1 


-1 


63 “ Pj - S 3 
S, 


0-0 = 0 
1 - 0=1 




Sj - s; = 0 - 1 = - 1 


^ = ®o " 2 “ “1 

j=*J- 


4 IT^ - IT 5 


note: cov^t io the optl:eal betit ere rnaiceted vith an atterrah. Col^na out of the batlt, but "paired'- to col:»ns in the basis 

are indicated with the letter 'P'. ' 
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SAMPLE PROBLEM - FORMOUTIOU (5-l) SHOWING ALTERNATE OPTIMAL TABLEAU 
A. Alternate Optimal Tableau 
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have a paired coluicn of the opposite sign in formulation (5~l)). Col- 
umns not in the optimal "basis hut ha^m.ng their associated (c. - z.) = 

0, (neglecting columns marlted with P) indicate that an alternate opti- 
mal solution can he attained with column T (3 ) in the basis and column 

0 

9 (3^) out of the basis. Figure 5~2A shows the tableau for this alter- 
nate optimal solution. Kote that the solution parameters have changed 
and the LUPF is -different. 

Figure 5-3 shows the same problem solved using formulation 
(5-2). The optimal solution is the same as that given in Figure 5--2 
using formulation (5-l). 


5.5 The Effects of Alternate Optima 


The appearsince of alternate optimal -solutions to -the L, ap- 
proximation problem, very siip)ly means that we should^be indifferent to 
the effects of using different estimated LUPF's which mi^t arise from 
the alternate optima. 

Each optimal LUPF gives the same 'best' fit to the user 


assigned utilities in the training set , in the sense that 


T. k. i is 


the same for each LUPF. 

A search of the rest of the file "srith a different LUPF will un- 
doubtedly yield different results, but "without using extra information 
to eliminate the alternate optima, one optimal LUPF is as good as any 
other. The use of extra Information to limit alternate optimal solu- 
tions is suggested in chapter 9 as an extension of the present system 
which might be investigated as a future research problem, 
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Figure 5-^ gives an example of the different utilities which 
would he predicted for the various term combinations when two alternate 

I 

optimal solutions are compared. All 32 combinations of five index 
terms are listed in Fig, (Term •' is fixed at 

T^ = 1 and hence does not affect 'the' nmuher of combinations.) The 
utilities which were assigned for the term combinations corresponding 
to the eight documents in the training set are shown separately. These 
combinations are numbered . 2 , 3 ; 13 , 21 ^ 25 , 27 ^ 29 . Uote .that two different 
documents were in the training set with the same index tem combination 
{combination 25). The assigned utilities were different for the two 
documents (one was relevant, the other was not). Solutions 1 and 2 of 
Fig. 5-^ show the LUFF's which correspond to the alternate optimal LP 
solutions illustrated previously in Figs. ' 5-1 and. ‘ 5 - 2 . Each bf-these 

I 

solutions provide a 'best’ (but different) fit to the training set 

I 

utilities. They also provide different utility predictions for docu- 
ments outside the training set,. In some cases differences in the pre- 
dicted utilities cause the predicbed document relevance category 
(u ^ T = O) to differ. For example, the term combinations 4,8,15,16, 
22,28,32 are predicted relevant using solution 1 but non-relevant using 
solution 2. Combination 17 is predicted non-relevant xmder solution 1 


bun relevant under solution 2. 
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5.6 Secondary Featnre Extraction 

By referring to Figure 5~lAj note that the suhrmatrix X has six 
columns. Each of these six columns represents a possible term in the 
LUPF, Five of these columns represent specific index terms which had 
been previously selected using the information measure of chapter 

The optimal tableau shown in Fig. 5“1B indicates that only four 
(out of a possible six) columns of X (or.-^) are in the optimal basis. 
Four- (out of a possible five) index teims have been assigned to the 
LUFF shown in Fig, 5-lC. A secondary index term, selection has talcen 
place. 

This secondary term selection (or feature extraction) process 
has the effect of discarding automatically index teims (columns) from 
the basis which are linearly dependent on other terms in the basis. 

If the least squares solution were used instead,, the linearly 
dependent columns of X would have to be eliminated before solving the 
normal sq.uations. The formulation here eliminates this extra 

operation. 


5 .7 More Efficient Algorithms • 

It can be noted that the parameter vector b — x obtained 
vTiuh the ,L^ norm configuration has elements which are integral mulx- 
iples of 1/2, i.e., = ± n/2. This effect is obviously dependent on 

pt'operbies of the invex-ses of matrices xrhose elements a:ye all -i-l, -1 or 
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2ero» and of the -integral properties of the right hand side vector (the 
'•utilities). i 

The properties of ^ suggest that perhaps the Lp problem for 
this type of matrix can be solved, vrith a transportation or net-work tj'pe 
of algorithm. Investigation of this -was outside the scope of this 
vork; 
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6.0 DETISMINAT-ION OF THEI OFCmL BRS 
6.1 Scope and Organization 

The optimal BRS is a set of searching instructions which re- 
trieves from, a file only those documents ha^d-ng a predicted utility 
greater than or eq.ual to a given utility threshold. 

The optimal BRS is derived from the LPBI which is formed hy 
thresholding the document LUIF. 

This chapter discusses mathematical properties of the LiPBI and 
of its solutions . A composite algorithm is presented which finds all 
the solutions to the LPBI and groups these into solution families 
which are .mutually disjoint. This composite algorithm is "based on 
■'.n.siting the nodes of a, "binary tree in search of possi"ble solutions to 
the inequality. It is called the Tree Pruning Algorithm (TPA) , and ; 

s 

uses a braneh-and-exclude technique which allows all solutions to be 
found without constructing or exploring the entire binary solution 
tree. 

The composite TPA can be broken down into two parts. The -first 
part is a node-visiting sub -algorithm. Here decisions are made (after 
visiting a tree node) about v;hich nodes of the tree to exclude from 
future visits. The second part of the TPA is a visit-scheduling sub- 
algorithm which controls the sequencing of node visits. This sub- 
algorithm guarantees that each ncn-excluded node is visited once and 
only once in a defined order. It also keeps node records necessary for 



use by the node-visiting sub-algoritbm. The visit-scheduling sub- 
algorithm is necessary to implement the TPA on a digital computer. 
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The concepts and theory pertinent to solving a LPBi by a node- 
visiting method have been given elsewhere by Hammer and Rudeanu'' ’ 

p-5'j 

. Most oT the mathematical details presented here are also from 
these references. An. exception is section 6.323. Here some proofs are 
presented which are related to transformations used to solve the LPBI. 
These proofs are not given by Hammer and Rudeanu. Backgromd theoret- 
ical results and details of the node- visiting sub-algorithm are pre- 
sented in the first part of this chapter, up to and including section 

6.5- 


The visit scheduling sub-algorithm is the Author's contribution 
tc the TPA.. It is a modified form of a pre-order traversal algorithm 
for binary trees. This sub- algorithm allows djuamic visit-scheduling 
as portions of the binary tree are seq^uentially excluded from further 
consideration. Development of this, sub-algorithm begins in section 6.6. 

Tlie operation of the composite TPA is illustrated with examples., 
and computational experience with a Fortran XV program is discussed. 

The use of the LPBI solution families to retrieve documents is 
discussed near the end of the chapter. 


6.2 The LPBI Arising from the Document LUPF 
It is ass^am6d that s- LUPF exists which adequately expresses 


the utility of documents in the file as a linear combination of 



selected index term weigh^s, i.e. 


n ' 



D=0 


T e{0,1} 

J 

-CO < a < ^ 
0 


which "becomes a pseudo-Boolean ineauality when thresholded; 


/ a.T. > (t 
Z__ J a “ 
0=1 



< T < ” * 


After conversion of the coefficients a. and the right hand side 

U 

(t - a ) to integers y, and 6 by a scaling and truncating process ^ 
o 0 


I ■ 






T,c {Osl} 

6,y.e {!] 
■ 0 


(6-1) 


where I is the set of all integers. 

For all further results in this chapter the LPBI will be 
assumed to have integer coefficients. This represents no loss of gen- 
erality/because by scaling all coefficients and right hand ..side^- ^d 
then dropping the fractional parts, if, any, the coefficients can be 
converted to integers with any desired degree of accuracy. 

All solutions of inequality (6-3.) are O/l vectors ^ = 

(T ) . There are at most ^ vectors T, satisfying 

kl’ k2’ kn • ~k 

(6-l). Solution by enumeration is 3J.waj,’’s possible but becomes 



95 


impractical for all "but small problems. 'Moreover, solution by enumsra- 
'tion does not group solute on vectors into f ara3.1ies . 

Grouping of solution vectors into families is important for 
two reasons : ^ 

..(a) one solution fami.ly provides a compact mathematical representation 
of many solution vectors ; 

(b) the solution families are me^ingful in the modeling of document 
retrieval systems. More -wlil be said about this in section 6,^k. 

' 6.3 Properties of -the LPBI and Its Solutions 

As a prelude. to developing, an algorithm to solve the ineq.uality 
(6-l) for all of its solution vectors and/or families of solution vec- 
tors, it is necessary to investigate a more general form of (6-l). 

6.31 General Form of the LPBI 

Let the -linear pseudo-Boolean inequality in its general form be 
defined by ; 


z.\. > S • (6-2) . 

0 J " 

where a.j ^ given parameters with 

0 J 

G e {0,1} j=l,2,* • ° ,n. 

J 

Y. j6e {r> 

J 



the ‘set of all integers 
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and where ^ =' 

* 0 

is a solution vector, with z^e {0,1} = 

t] 

The exponents are used to indicate Soolean complements mth the 
following conventions: 


z? = z . , the complement of z . ; 
d J d 

1 _ 

z . = z . ; 



1 

z. 

J 






a. 
-i-' a 


a . 



(6-3) 


As a conseq^uence of this exponent notation, note that: 




a. = a. 
0 0 


o 

z . 


0 


if 


a. a. . 
J J 


Tlie inequality (6-l) arising from the LUPF is equivalent uo 
(6-2) if all a. = 1. The algorithm developed in this chapter will 
solve form (6-2) of the IiPBI. 

The adjective pseudo-Boolean in^ilies that while the variables • 
z. of (6-2) are binary valued, the coefficients are not, and hence the 


L(z 


n 

c£ „ 

“Z- b'b 

j=i 


function 
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is a mapping of the "binary vector ^ into the set of positive or neg- 
ative integers. This is in distinction to a Boolean function f(^) 
which would map the "binary vector ^ into the binary set {0,1} . 

6.32 Canonical Form of the LPBI 

Before solving the ineq.uality (6-2) it is necessary to reduce 
it to a standard or canonical mathematical form. 

The canonical form, is defined by 


c.x ^ d; (c.,d) e fl) 
J J tj , 


(6-4) 


j=l 


where x = (x^) , 3=1,* * ’ »n / is the solution vector and fl ^ 

c >0. This form has all positive coefficients o., ranked by order- 
n ' J 

of magnitude. In addition, no complemented variables x. appear. 

J 

6 . 321 Transformation. of Parameters of the LPBI from the General Form 
to the Canonical Form . The transformation from (6-2) to (6-4) proceeds 
in two stages. 

First, all negative coefficients are eliminated by the following 

transformation, (and all f. are relabeled e.): 

J J 

Y . > 0 (y . <- e . i a. a . ) 

3 ' 3 3 3-3 


Y < 0 (y> “6 ; a <- a = 1 - a.) (6-5) 

J <J U u J 


2.- b 

(Yj<0) 


d f- S 




where a -c- h is read "a is replaced by b". At this point a new in- 
eq^uality may be defined by: 
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n 

y~ y^e . ^ d (6-6) 

L» ... 0 (] 

0=1 


y^e {0,1} 

(ej,d) € {!) 

e . > 0 . 

J 

The coefficients e . are nesrb permuted and relabeled so they 

0 

are in descending order, as specified by (6-U). We define a transforma- 
tion from e . to c . by 
J 0 

• ( 6 - 7 ) 

c.' -c- e, 

J k .. 

j=l,2,- • • ,n. 


where P{j) is a permutation which puts coefficients e. in descend- 

0 

ing order. This completes the transformation of par^eters to ( 6 - 4 ) 
from (6-2). 

For example, consider a pseudo -Boo lean inequality whose para- 
meters coiisist of: 


1 

1 

2 
3 
k 
5 



-3 

5 

-1 

2 



1 

0 

0 

1 

1 


6 = 0 
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Eliminating negative coefficients results in new -Darameters. 


d = 6 


Eenmting and relabeling coefficients ej as Cj gives: 


3 

2 

i 

5 


d = 6 


Bie permutation P(j) is obtained from a sort of the e,. If 
the indices i are sorted alon^t' 'wj.th the e . > the result is P(i)- 

o „ y 

Note that the a. are transformed into the a, when the negative co~ 
J . J 

efficients are eliminated. Permuting and relabeling does not modify 


the a . . . ■ 

0 - 

Ne will he concerned with solutions 3^ = } of the canon- 

ical form (6-h). The approach is to find solutions to this form, then 
perform appropriate inverse transformations on these solutions to get 
vectors ^ = (z,^.) which satisfy inequality (6-2), 

—K KJ 

6,322 Transformation of Solutions of the LPBI from the Canonical Form 
to the General Form , We have defined three inequalities hy performing 
the preceding transformations on the parameters. These are repeated 
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belov for comparison , 



(6-2) 


( 6 - 6 ) 


(6-4) 


Solutions to (6-4) will be appropriately transformed so they 
become solutions to (6-6) and finally (6-2). These inverse tx'ansform- 

c 

ations proceed, in two steps, as follows: 

(a) from x 21 ''■^here 

' K ^ r(j; 


1 i 


J —1 j * • * jKi 3 


(b) from 21 to ^ where 

a. = 1 1 - y^ 

d 3 3 

a. = 0 =5> z. y . ; 

J JO 

1 

that IS : z . y . . 

3 3 


(6-9) 



101 


■ low. 


The transformations defined abore can be depicted as shorn be- 



canonical form and as its ’image set all solutions of the general 

form (6-2), 

6.323 Some Proofs of Results Related to the Transformations . It is 
easy to prove that a binary vector ^ = (z, .) is a solution to in- 
equality (6-2) if and only if the corresponding vector 3 ^ = ^ 

a solution to inequality (6-U) when (6-5) a,nd (6-7“) are used to trans- 
form the coefficients, and (6-8) and (6-9) are used to transform 3 ^ 

to z, . That is , 

-fc 



( 6 - 10 ) 


To show this it is convenient to establish two preliminary re- 
sults. First, note that we need consider only transformations from 
(6-6) to (6-2) instead of fi'om (6-4) to (6-2). This is because a solu- 
tion 3^ to (6-4) is always transformed by (6-8) into a solution 
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of (6-6). Recall that transformation (6-8) is merely a permutation of 
coefficients, i.e. 


x.c. = y„/,^e„V,^ j=l,2,’--,n 


j j '^P(j)P(j)'5 


n n 

1 ^ \. . 1- 


^ L = L vj • 

j=i j=i 



1 


X.c. 


a a 



( 6 - 11 ) 


Another preliminary result is derived from the assumption with 
no loss of generality that the first p coefficients Tj are positive 
and the. last (n - p) coefficients 1% are negative , i.e. 


Y ->■ 0; j=l,2,-- .,p 

t} 

J 


(6-12) 


Then after the transformation (6-5), note that we can conveniently ex- 
press e.,a. and d in terms of a. end 6 as foUenfs: 

JO JO 


n 



j=p+l 


e, = Y- \ 

J 0 1 


e. “ " YA 
J J I 

' a. ^ a. J 
J J 


(6-13) 


j=p+lA - An 
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Ifow by -using (6-3), it follo-^js that: 



(6-DO 


By -using the above results^ 


the first 


half of (6~10)j i.e. 


is proven as 




follows : 



a. 

z ,'^Y- 
0 J 



> d 


•-P 

y^e , 

L <• 

j=l ■ 




n 

r” 1 

p+1 




(6-15) 







p+1 


> 6 





j-1 


6 


a. 

and using the fact that z„ = y.*^ from (6-9) we ha-ve the desired 

(3 0 


res-ult . 



Next we would like to prove the second part of (6-10) which is 


the converse of (6-15), ioC. 



However, this is eq.ui valent to showing that 


(6-l6) 



and this result can be sho\m by exactly the same technique 'used to 
prove (6-15). 


sous also thau the transfonuatioii (6—5' from y to 2 , is 


one-to-one, i.e= 


(Z-i ^ ^2)- 


(6-17) 


This is obvious since (6-9) simply complements certain fixed elements 
of 21 “to 1.° 

Results (6-10 ) and (6-17) are Important because they guarantee 
that all solutions to the original inequality (6-2) will be found by 
first transforming the parameters using (6-5) and (6-7) to geb the 
canonical inequality (6-H); solving this inequality for all its solu- 
tions and transforming these solutions bade. Tliese transformations 
are summarized in Fig. 6-1. 
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FIGURE 6-1 

FLOW CHART SHOWIllG TRMSFORI4ATIOMS HWLVED IH THE SOLUTION OF A 
LINEAR PSEUDO-BOOLEAN INEQUALITI 
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As an exajaple of inverse transformations of solutions, otserve 

that X = (0,1, 1,0,1) is a solution to the canonical inequality used 

! ■ > 

previously in section 6.321 as an examples x transforms to = 

(1,1, 0,1,0) which transforms to ('0,1^1, 0, 0) , This last vector 

satisfies the original inequality, since -2(0^) - 3(1°) + 5(1*^) - 
l(O^) + 2(0“) = 0. (Recall from (6-3) that z° = z.) 

Another result which will he useful later to relate values of 

Cw’ « 

\ y.e. to \ before and after transformation (6-9) is 

a j - . 

given below 


n 

V ^ 

/ X . c . 

i=i 


n 






ry.<0'l 


This is easily proven by using preliminary results (6-II) 
through (6-lU) and (6-9) which gives: 




As a corollary to this we can state that the inverse transform- 


ation (6-9) is order-preserving, i.e. 


n n % 

L y^.e, > ; ■ for m 


j=l 


0=1 


, m n 

/ . a, r” a. 

I > 'h L "h, 



0=1 


j=i 


(6-19) 


6.33 Families of Solutions. 


I 


A. set / of solution vectors formed from a given solution 


vector z.-= (z^^ , • • • ,z^ ) and a. set of indices I C {1-2 ° .n> 

ux un 

is called a famil:/ of solutions. All members of the set match the 


solution vector ^ at the indices in I and are free to vary at all 
indices not in I. 

For example, ^ = (0,0,0,1,0) is one solution of the example. 

Let I = {1,2,3}. The set T” (^>1) of solutions contains four solu- 

L-. 

tion vectors (including ^) 


( 0 , 0 , 0 , 0 , 0 ) 
( 0 , 0 , 0 , 0 , 1 ) 
( 0 , 0 , 0 , 1 , 0 ) 
( 0 , 0 , 0 , 1 , 1 ) 


This family can also be noted as F = (0,0,0,-,-) where (-) indi- 


cates either 0 or 1 
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I 


If l_ contains only one vector, namely it is said to "be a 

degenerate family of solutions « The number of vectors in I is given 
by -where r is the numher of fixed variables (elements) in I. 

A group of solution families is disjoint 

if each solution vector belongs to one and only one solution family. 

Our goal is to find al3- solution points ^ to the ineq.uality 
(6-2) grouped together into families. It can be sho^.rn (see section 
6 . 352 ) that the method used to group solution points into families re- 
sults in mutually disjoint solution families . 

Families of solutions -t^ill be found to the canonical ineq_uality 
(6-U), and these families will be transformed to solutions of (6-2), 
using the inverse transformations (6-8) and {6-9) • A family of solu- 
tions is transformable by (6-8) and (-6-9) ’'■rith the obvious convention 

that in (6-9) if y. = (-), then z -<- y = (-) irrespective of 

0 J J 

■tfhether aj = 0 or = 1. 


6.3li The' Relationship bet-(.resn Binary Trees and Solutions of a LPBI 


Certain isomorphisms exist between binary brees^ ^ and solu- 
tions to pseudo-Boolesin ineo[ualities . These relationships prove in- 
valuable for developing algorithms to solve inequalities and to visu- 
alize the solution process. 

6 . 3 I+I Isomor phism of Tree Paths to Possible Solutions , bach possible 
solution to a pseudo -Bools an inequality may be pictured as a path 
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through, a hiaary solution tree ^ This is illustrated in Fig« 6-2A 
for the -ineq.uality i 


+ 2Xg + ^ ■> 


Starting from the root node r, if we proceed to the left to 
node "a, then = 0. If we go to b from r, then x^ = This 
takes us to stage 1. To go to stage 2 , we can move to c or d from 
node a, or from node b to either node e or node f . The stage of a node 
in the solution tree is the number of levels which the node is removed 


from the root node. There are n+l’ stages in the complete solution 

t 

tree associated with an inequality having n variables . 


If we traverse the tree from the root node r 

■ . I 

the path r a d -5- i, we 'have enumerated one or tne 


to node 



i in 
bi'nai-y 


vectors x = = ( 0,1,0). Amove along a left branch from one 

stage to the next implies that the variable x^ associated with that 


stage is to he set at zero. A move to the right implies that the var- 


iable is to be set at 1. 


By traversing a path from the root to each of the terminal " 
nodes (leaves) of the tree, each binary vector x can be enumerated. 
Each X could be tested to find only the which are solutions to 

the inequality. We conclude that each path from the root node to a 
terminal (leaf) node is isomorphic to a possible solution point xf* 
By inspection, nodes l,m and n represent solutions to the 


inequality. 



FIGURE 6-2 


SOLUTION TREE AND ASSOCIATED DATA FOR A SIMPLE INEQUALITY 
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B. Partial Path Records and Partial Ineaualities Associated vith Tree Nodes 


Partial 
Path record ■ 
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(0,-,-) 
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. (I,-,-) 

2Xg + X 

(0.0,-) 
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X 
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X, 


X3> -1 
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6 . 3^2 Isomorphism, of . Tree I'Todes to Partial Path Records and Partial 
Inequalities . Associated >iith each node in the tree is a set of fixed 
binary variables and a set of arbitrary binary variables. 

The fixed set .of variables represents a partial path record 
(PPR) from the root node to any other node in the tree. PPR’s become 
complete path records when the path is traced from the root to the 
terminal (leaf) nodes. See Fig. 6-2B for an illustration. The set 
of arbitrary variables are those necessary to specify a complete path 
record from a PPR, For exan^le, at node d, the fixed variables are 
x^ and Xg , while x^ is arbitrary. 

A partial inequality (PIN) can also be associated with each 
node in the solution tree. The variables in these PIN's- are those in 
the set of arbitrary variables , while the set of fixed variables and 
their coefficients are absorbed into the right hand side of the PIN. 

At any p^^ stage node there are p fixed variables and 
(n-p) arbitrary variables. The PIN associated with a p^^ stage node 
is given by: 



As an example, node e of Fig. 6-2A has an associated PIN given 
by: 


“^3 f . ^ " [ 3 ( l ) + 2 ( 0 )] ^3 - - ' 
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Fig. 6-2B lists the partial PER' s and PIN's associated with 

■ all nodes of ' the solution tree shovm in Fig. 6~2A. 

i 

It is possihle to const inxcn a hinary solution tree for any 
pseudo-Boolean inequality, whether it is in general or canonical form. 
Fig. 6-3 shows a solution tree for an inequality in a general form. 
Fig. 6~k shows the solution tree for the same inequality after trans- 
formation to canonical form. 

The canonical form solution tree has special properties which 
enable, families of solutions to he built up automatically from special 
types of solution tree paths known -as basic, solution paths (BSP's). 
These will be discussed 'extensively in the following sections. 

6.35 Solutions of” the Canonical Form 

^ I 

Fig. 6-4 shows the solution* tree associated with the canonical 

form of the inequality used as an example in section 6.321. For uhe 

canonical, inequality all solution values are hoimded between 0 and 

^ c.. There are no negative values. Tliere are 19 solutions to the 
T ^ - • 

canonical inequality, just as there were to the original inequality. 
6.351 Basic Solutions . ' Of T.he 19 solution vectors seven have 
special properties. These solutions are called basic solutions . They 
are formally defined as follows . 

A basic solution to the canonical inequality (6-4) is a solu- 
tion x^' = (x*,x* 5 ' • ,x*) such that for each index i with x^ = 1 

the vector (x*,'-‘,x¥ ^.O.x^' .•••,x“) is not a solution of (6-4). - 

1 ’ 1-1-^ 1+1 ’ n 






FIGURE 6-h 

BIMARI SOUITION TREE ASSOCIATED WITH Ail EPBI IB CANONICAL FORM 
5Xj_ + 3^2 + 2X3 + 2Xi + Xj >, 6 
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6-352 Canonical .Solution Families . Given a "basic solution it is 

.possible to define a family of solutions 1 

\ 

manner which exploits the minimal property of the basic solution. 

A solution family =^( 3 ^, 1 ^^) constructed from a basic 
solution ^ using the following rules will be called. a canonical 
solution family . Let £,(1 £ a _< n) ' be the last "index for which 
x*^ = 1, where ^ * * ’^n^ ^ basic solution. is 

then defined to be the set of all indices i 1. 

IThe basic solution is a minimal solution vector in the 

sense that changing any of uhe variables from 1 to 0 gives a new vec- 
tor ^ which is net a solution. It is defined only for the canonical 
form of the LPBI, where all coefficients are positive and all vari- 
ables are uncomplemented, | 

In terms of the solutd on tree , a .basic solution corresponds to 
a solution path throu^ the tree which does not reraain a solution path 
if any right branch is changed to a left branch. In Fig. 6 - 4 , the 
basic solutions corresponds to tree paths numbered 12 ,l4-,15 , 18 , 19 ,21 
and 25 . A path through the tree corresponding to a basic solution will 
be referred to as a b asic solution path (BSP) . 

Referring to Fig. 6-4 , path number 21 through the binary tree 
corresponds to solution vector x*^ = (1,0, 1,0,0). This solution is 
basic and path nxaaber 21 is a BSP. It can be made into a canonical 
solution family by allovring arbitrary values for the last two 0-valued 




ii6 


vector elements. We can denote this family by 


^ 21 “ 


•» I ) = 
21 ’ 21 ^ 


(l^O,!,-,-), where * 1^^ = {1,2,3} . 

Canonical solution family contains 2*^ ^ =2^ ^ = i;- 

solution vectors as members, These are shovni as paths numbered 21-24. 
pie BSP is seen to be the left-most tree. path in the family. Some can- 
onical solution families have only one member (the BSP) and are said to 
be degenerate solution families . In Fig. 6-4, paths numbered 12 and l4 
are families of this type. 

It can be seen that by knoiring only the basic solutions that 
all other solutions to the canonical inequality can be enumerated. 

This is formalised by the following result which has been proven by 

/ Ocr \ 

Hammer and' Rude anu^ , 

Every solution to the. canonical inequality belongs to one and 
only one canonical solution family . 

• Because the inverse transformation of canonical solutions is 
one-to-one (see ( 6 - 1 T ))9 the above result holds after the transform- 
ation. Thus, when the canonical solution families are subjected to 
the inverse transformations ( 6 - 8 ) and ( 6 - 9 ) , we get mutually disjoint_ 
solution families to the original inequality . 

The problem of solving the pseudo-Boole an inequality is now re- 
duced to the problem of identifying all basic solutions of the canon- 
ical inequality. This will be th'e subject of the next section. 
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6.36 Stunmary of Solution Proeediire for the LPBl 

Section 6.35 shows ohat the solutions to the LPBI (6-2) may he 
obtained in. mutually disjoint families by the following procedure: 

(a) transform the original inequality to canonical form; 

(b) determine all basic solutions to the canonical form; 

(c) construct canonical solution families using each basic solution; 

(d) inversely transform the canonical solution families and get solu- 
tion families to the original LPBI (6-2). 

6.4 Determining Basic Solutions of the LPBI by Searching 
the Binary Solution Tree 

6i4l Preview of the Tree Pruning Algorithm (TPA.)^ 

The method used to determine basic solutions of rhe canonical' 
inequality is based on finding all BSP's in the associated binary 
solution tree. This method relies upon systematically 'visiting' 
nodes of the tree , starting at the root node and moving in a domward 
direction toward the terminal (leaf) nodes. When a node is 'visited', 
the parameters of the associated Plh are examined.’ This gives inform- 
ation about which nodes to visit next. 

For each node visited, it may be possible to eliminate further 
downward motion in the tree through one of the following two devices : 

(a) by determining that no BSP can exist using a branch directed davm 
to the left, right (or both) of the current node; 

(b) by enumerating dll complete BSP's ■which emp3.oy branches directed 
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down to the left, right (or hoth) of the current node. This makes fur- 
ther downw'ard movement unnecsssary . 

^'ftien all do'vmward paths through the solution tree have been 
blocked by (a) or (b), it follows that all BSP's have been found, and 
the node visiting operation stops . 

The elimination of downward (avray from the root) movements in 
the tree through results obtained higher up (closer to the root) in the 
tree can be called a 'branch-and-exclude ' scheme. The subtree whose 
nodes are actually visited is then a small segment of the original so- 
lution tree. This subtree can be considered to arise from the original 
tree by a branch-cutting or pruning operation. For this reason the 
final algorithm, developed is called a tree pruning algorithm (TPA) . 

At a given node, the decision tc prune and/or to enumerate 

I 

BSP’s is based on a classification scheme to be applied to the para- 
meters of the PIU associated with' the node. The classification scheme 
is due to Hammer and Rudeanu and is discussed in section 6.52. 

When they are identified, complete , BSP's are constructed using 
both the PIK and the PPR at any given node. This is discussed in 
section 6 . 51 . 

Development of the TPA can be broken do^m, logically into two 
parts. Definition of what is done when a node is visited is one part. 
The other part is concerned with the schediiling of node visits . Al- 
though these two logical parts are linked (node visits can alter the 
schedule of remaining visits), it will be convenient to consider the 
node visiting portion first.. 



119 


Section 6.5 provides theory and methods relating no what is 
■'done at an individual node when it is visited. This includes construc- 
tion of BSP's and pruning of the solution tree. 

The scheduling and record keeping' details related to node 
visits are deferred to section 6,6. 


6.5 Solution Construction. and Node Visits 


6.51 Constructing Complete BSP's from Partial BSP's 


As nodes in the subtree are visited, the PPR is maintained. 
Thus suppose at some node currently being visited, a basic solution to 
•the PIU is identified by the scheme to be presented in section 6.52' 
Then the complete ESP consists of two parts and is constructed in the 
following manner. 

The first part of the complete BSP is‘ the PPR to the current 
node. The second part is the basic solution of the PIN associated 
with the current node. 

These remarks may be formalised by the following results (see 

(. 86 ). 


Hammer and Rudeanu 


). 


» * * -x- 

(A) Let X ,x ‘,x ) he a basic solution of the 

2’ ’ p’ p+l’ n 

eanonic‘al' inequality (6-4). Then is s basic solution of 

the inequality 


n 


C.x. > d r 
0 J “ 
j=p+l k=l 


k j£ 
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If is a basic solution of the inequality 

n 

C.x. > d, 

J=p+1 

then (O5 — a basic solution of the complete canoni- 
cal inequality (6-ii). 

" ‘ ' -X- ^ 

(C) If d -> 0 and ^ basic solution of 




C,x. > d - 
J 0 “ 


1 


9 


. S * 

then (l,Xg,X2> » ' • is a basic solution of (6-J|). 

Result ,(a) allows partial paths to be excluded from further 
consideration when they are "dead-ended" by a Pill which has no solu- 
tion. (Use the contrapositiTre form of statement (A)„) 

■Repeated applications of (b) and (C) allow construction of 
con^lete BSP's from PIN basic solutions and PPR's. By repeatedly ap- 
plying (B) and (C), one sta,rhs with a basic solution of the PIN and 
constructs s, con^lete BSP by prefixing one element of the partial path 
record at a time to this basic solution. Results (b) and (C) validate • 
the formation of a complete BSP by simply prefixing the PPR to the ba- 
sic solution of a FIN at the node being visited. 
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6.52 Node Visi’fcs Summard zed in Terms of PIN Parameters 

*■ ' i' ft*T 

Bji- using (a),(b) and (C) of 6.51 above, Kammer and Rudeanu^ ’ 

88,89) 

have built up the clever Solution Decision Table shoim on Fig. 
6-5- This table is important because it permits inferences to be made 
about the solutions of a PIN singly by inspection of its coefficients 
and right hand side. 

The flow ehar-t on Fig. 6-6 presents a modified version- of this 
decision table which shows the seq.uence of calculations which are per- 
formed on the parameters of the PIN associated with the current node. 
This flow chart is applied when the node is 'visited' . Examining the 
parameters leads to a classification of the PIN into one of T mutually 
•exclusive cases. Each of the T oases gives information about basic so- 

I 

lutions and exclusion of neighboring nodes in the tree, 

I 

Thus at any node of ‘the solution tree p basic solutions to 
the PIN may be identified where P £ n. ' In addition, one or both of 
the branches extending from the current node may be excluded from fur- 
ther consideration. 

Fig. 6-6 defines exactly what is done when a node is visited. 
This completes the discussion of this part of the TPA. Scheduling of 
node visits is next considered. 

6.6 Scheduling Node Visits in the Binary Solution Tree 

This section develops methods for the following items: 

(a) scheduling node visits in the binary solution tree; 

(b) maintenance of FPR's corresponding to the node being visited; 



FIGURE 5-5 


SOLUTION DECISION TABLE^ 


Case 

Conclusions 

Validation 

d^O 

The unique basic solution is 
xi = X2 ■= ... = = 0 

Obviously 

d>’0 and 

ci>* ->Cp>d>Cp^j^^. 

c<) For every k = l,2,...,p: 
xj^=l,x^=. . 

is a basic solution, 

/3) The other basic solutions 
(if any) are characterized by 
the property; xj^=. . ,=Xp = 0, 

and (Xp_|_^, . . . ,x^) is a basic 

solution of n 

Y* c.x.>d 

J=P+1 

' Obviously 
by (a) and (B) 

d>0,Ci<d(i=l,2,...,n) 

n 

and c .<d 

i=l' 

No solutions 

Obviously 

d>0,c^<d(i=l,2,..,,u) 

and ^ C£ = d 
i=l . 

The unique 'basic solution is 

Xi = X, = . . . = X =1 

i 2 n 

Obviously 

d>0,C£<d(i=l,2, . . . ,n) 

n n 

^ and c ,<d 

i==l j=2 ^ 

The basic solutions (if any) 
are characterized by the 
property: x^ = 1, and 

(x 2 ,.,.,Xj^) is a basic solution 
of 
n 

^ c .Xj>d - Cl 

J=2 

by (a) and (C) 

d>0,c^<d(i=l,2, . . . ,n) 

n n 

J^c^>d and ^c^'^d 
i=l j=2 

The basic solutions (if any) 
are characterized by the 
property: either x^ =1 

and (X 2 ,...,x^) is a basic 

solution of 
n 

y' c^x-^d - Cl or; 

1=2"’" 

xi=0 and (x 2 , - . . , Xj,^) is a 
basic solution of ' 

EcjXj^d 

3=2 

.by (A), (B), 
and (C) 


1 

^Froiii; Peter L, Hanimer and Sergiu Rudeanu, Fsoudo - Boolean Methods for 
Bivalent Prograncning, Lecture Notes in 2-Iatheip.aties , Voi. 23 , (Berlin, 
Heidelberg, New York: Springer -Ver lag, 1966), page 27, 
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(c) maintenance of a PIN coefficients list corresponding to the node 
being visited. i 

Item (a) above is developed by first considering a simple al- 
gorithm for scheduling pre-order binary tree traversal. {Tree traver- 
sal is the process of visiting all nodes in some specified order^^^^). 
This simple algorithm is presented in section 6.62. It does not allow 
the outcome of node visits to alter the schedule of other node visits. 
The entire tree must be defined prior to traversal in this simple al- 
gorithm. 

Section 6.63 discusses modifications to the tree traversal al- 
gorithm (TTA) to permit tree pruning. Tree pruning ’is the process 

whereby the tree traversal schedule is modified by results obtained 

/ 

when tree nodes are visited.' 

. I 

Finally section 6.64 gives details on how the dynamic PPR and 
PIN records are maintained d\iring the traversal. 

Section 6.6l precedes all the above with a simple exaanple of 
how the TPA should work to illustrate the problem of dynamic scheduling 
of node visits. 

6.6l A Simple TPA Example Problem 

Consider the tree shorn in Fig. 6-2A. One method of starting 

at the root node and sequentially visiting each node in the tree only 

(91) 

once is called pre -order tree traversal 

The pre-order traversal sequence applied to the tree gives the 
following order for node enumeration t r->a->c^g-->'h->d'^i-^j->-b-->'e-^k>l-^f->-m-^ne 
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At each node in the order given above » the Fils’ is classified using Fig- 

6 - 6 . 

For node r, we have case 6 of Pig. 6-6 since d > 0; C, '< d; 

^ > d; and < d. The basic solutions, if any, are found by 

'i=l i=2 

setting = 1 and advancing one stage do'im to the right to node b. 

We have bypassed bhe entire left branch of . the tree (where = O) . 
Thus we have eliminated nodes (a,c ,d,g,h,i , j ) from further considera- 

V 

tion. This is an illustration of the pruning operation. 

The revised schedule for pre-order traversal of the remainder 
of the tree is b-»^k->-l-^f->-m-J-n . At node b we consider the. PIN: 2x^ + 

^ 1. This inequality matches case U of Fig. 6 - 6 , since ~ 

d=^p - 2 = n. Thus the basic solutions of the PIN are given by (l,0) 
and' (0,l). Since (Xj^,Xg,X 2 ) = (l,— ,— ) is the PPR at node b, the BSP's 
to the original inequality are given by (l,l,0) and (l,0,l). This con- 
cludes the traversal process since all other nodes have been excluded, 
and the algorithm terminates after node b has been visited. 

Thus by analyzing PIN's at two nodes of the 15-node tree, all 
the basic BSP's have been found. The ideas presented in this example 
represent the basic procedure used to identify all the BSP's in a solu- 


tion tree. 
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6.62 The Pre-Order Tree Traversal Algorithm (TTA) 

The general LPEI solution procedure has been illustrated in 
the preceding section. An important characteristic of this procedure 
is the successive re-definition of the traversal schedule which oc- 
curs as a result of node visits. A. separate sub-algoi-ithm to handle 
dynamic changes in the traversal schedule is needed. 

The algorithm for dynamic scheduling used jn the final TPA 
has been derived from a simpler algorithm called the pre-order TTA. 

The TTA allows no dynamic modification of the tree structure and re- 
quires that the entire tree he defined before node visiting begins. 

To promote understanding oi' the final TPA, the simpler TTA is pre- 
sented here in detail. 

There are three principal ways to traverse a binary tree, 

- visiting each node once and only once. These methods all give rise 
to a specific ranking of the tree nodes in the order in which they 
will all be visited. They are termed pre-order, post-order and end- 
order traversal^ . Pre-order traversal will he used here. It is 
defined by the following successive steps : 

(a) visit the root; 

(b) traverse the left subtree j 

(c) traverse the right subtree. In the example stated previously in 

section 6.61, the tree of Pig. 6-2A has a pre-order traversal schedule 
given by: (r,a,c,g,h,d,i, j,b,e,k,l,f,m,n) . 
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Before describing tne method used to guarantee pre-order trav- 

, ersal, it is convenient to discuss bhree data structures required, 

! 

namely a link table, a pusMovn list and a single wcrking storage 
location. Bhe link table is necessarj;- bo shoT; how the tree nodes are 
linked to each other. For the tree of Fig. 6~2A we can show a link 
diagram and corresponding Unit table' (see Fig. 6-7). She tree struc- 
ture is eompletely defined by the link table. Each tree node has a 
left and a right link to other nodes. Tree nodes are given an integer 
tag for internal machine use, but this tag can be related to other 
symbols via a -look-up table. The null link is represented here by -1. 
The data structure STACK is a push-down, pop-up list with last-in, 
first-out (liTFO) discipline. STACK functions as a 'memory' for nodes 
remaining to be visited, i. sin^e storage location labeled P is also 

I 

required to define the node curren-bly being visited. 

The following conventions will be used to describe data storage 
and data movement instructions. We read' I^LLIEK(p) as “replace the 
contents of memory location P with bhe contents of the memory location 
LLIfiiK(P)" Memory location LLITilK(p) is not modified by the preceding 
operation. For push-down list operations, we read B*-STACK as "replace 
contents of memory location P with the conbents of whichever memory 
location is at the top of the push-dora :i.ist STACK" . After -this opera- 
tion, the list STACK is -co be popped up, or shortened by one item. The 
data transferred to P is no longer stored in STACK after -che list is 
popped up. The list operation STACKt-P means that "the contents of meai- 
ory location P are to become the first item in the list STACK, on top 



128 


PJGuEE 6-7 

binary tree v/itr associated link diagram add dink table 

(a) Tree Diagram 



(B) Link Diagram 



(C) Link Table 


Node P LLINK(P) RLi;w{p) ' 
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of elements already in STACK." This pushes down the list by adding one 
more element. The contents of memory location P are not modified. 

The TPA is described by rhe flow chart of Pig. 6-8, (This de- 
scription is similar to that for a post-order TTA given by Knuth^®^^.) 
Operations shomi in this flow chaft are numbered. Written descriptions 
of these operations are given below. These descriptions are numbered 
to correspond to the nujnbers of Pig. 6-8, 

(1) B-ROOT, The mnnber of the current node is replaced by the 
number of the root node. This is an initialization step. STACK is as- 
sumed enroty. 

(4) VISIT P. Some operation is performed 3.t node P (such as 
Investigating the parameters of an inequality) . 

(5) STA.CEi^P- The node number in P is put on -the push-down 
list STACK. (Note that the contents of P are not modified.) 

(2) 'Ps -TJ. TNTC(p) . The node number in P is replaced by the node 
mnuber in LLINK(P) which is defined in the link table. This prepares 
for a move doira the tree and to the left, 

(3) p = -1?, Test to see if the contents of P are the null 
link. If yes, go to step (6) to determine whether STACK is empty. If 
no^ go to step (4). 

(6) STACK EMPTI?. If the push-down list STACK is empty, the 
algorithm is termin8.ted. If STACK is not empty’, go to step (?). 

(7) P<-STACK. Replace the contents of P with the node number at 
the top of the push -down list STACK. This pops up the list. Tiie tree 
move is upward and to the right, back to tlxe pivoi'. node. 
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FIGURE 6-8 

FLOVJCHART SHOWING AN ALGORimi FOR PRE-ORDER TRAVERSAL OF BINARY TREES 



START 


Note: The tree is assvuned defined ty a complete link table, with -1 as null link 

Rote: Algorithm steps are numbered to correspond to the descriptions given in the text 
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(8) P<-RLIKK(p). Replace the contents of P with bhe right link 
number of the current link in P, The uree move is dowvard and to the 
ri^t -5 avay from the pivot node. 

Operation of the TTA can be further illustrated with- an example 
using the tree of Fig, 6-2A. Fig. 6-9 shows a "snapshot" of the con- 
tents of the various memory locations afber each step of the algorithm 
(as shown on Fig. 6-8) is completed.- Thirty-nine sequential steps are 
shoT-nij which caused nodes r,ajC,gjh,d to he visited in that order. 
Traversal of the rest of the tree in pre-order can be continued in the 
same way until node n has been ei^^lored, at which time the algorithm 
terminates . 

The pre-order TTA consists of 2 types of operations: 

(a) moving downward and to the left in the tree^ one node at a time 
while retaining a record of do-wnward moves (node numbers) in the push- 
down list; and 

(b) moving back ug_ to the ri^t, one node at a time, by popping up 
the pushdo™ list> then moving down to the right, one stage. This is a 
'back up and go round the corner' type of move. 

Note that the push-down list STACK never contains more than 
(n+l) elements, where n is the nimber of levels (stages) in the tree, 

6.63 Modifying the TTA to Permit Tree Pruning 

The pre-order TTA presented in section 6.62 provides the basic 
framework for the TPA, However, there are Wo modifieauions of the TTA 
which must be made. These axe discussed below. 
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FIGURE 6-9 

EXA14PLE PEOBEm ILLUSTRATING THE TREE PRE-ORDER TRAVERSAL ALGORITHT-l 
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6.631 Bliminatiop of the Pre -Determined Link Table , The traversal al- 
gorithm requires that the iinh table he defined before traversal. In 
the search for BSP's j mstny of the tree nodes ■will never be visited, 
since they will have been excluded (pruned away) from further consider- 
ation by results obtained at nodes nearer the root of the tree. Link 
table information for nodes to be excluded-is not needed. To avoid, de- 
fining the tree completely ahead of time, the tree is constructed by 
the algorithm itself, and the only nodes '^■rhich are defined in the link 
table are those which must be visited, i.e. those which have not been 
pruned away by previous results obtained higher up in the tree. Thus 

s 

the structure of the tree is actually determined as it is traversed. 

Kfecessary modifications to the algorithm shown on Fig, 6-8 in- 
volve only the insertion of a new operation between the blocks labeled 

i 

1 

(1) and (5) as sho^ra, below:' 


( 5 ) ” ' ( 4 ) 




DEFIME 



STACK<'P 



llim:(p) ' 



'VISIT' 



ELINK(P) 


P 

\ 


This new operation is the definition of left and right links of node P. 
It can be considered as part of block (U) ('VISIT',?) if desired. 

6,632 Storage Allocation Modifications. Defining a link table as the 
tree is traversed introduces practical considerations . How can identi- 
fication numbers be assigned to new nodes? And, how much storage space 
is required for a link table used ■\fith a tree of given size? 

One obvious method for assigning node numbers is to define a 
new sequential integer for each new node that is discovered. The size 






13h 


of tlie link table is then proportional to. the size of the set of vis- 
ited nodes , i.e. 


1+2+1h-‘ • *+2^ 


z 


k=0 


2=2 


,n+l 


1 


for a tree with all nodes visited. This is the maximum size of the 
link table and several values are shown below. 


- ^ • 

n 

^fsantr 

■ 

2^ _ _ 1 

5 


63 

10 


20^7 • 

15 


65,535 


Clearly this method is unworkable, since maximum storage space 
req^uirements are much too great. 

The .method used in the TPA.is .to use node 'numbers over 
again . The node number (index in the link table) is assigned to a 
new node once the node it originally was assigned to becomes inactive. 
From the description of the pre-order TTA, it can be seen that once a 
tree node is removed from the push-dovm list STACK (a 'back up and 
around the comer' move), this node is not utilized for any further 
processing and will be defined as being inact ive . ( Active nodes are 
defined as those nodes which are in the push-down list STACK, or those 
nodes which are right links of nodes in STACK, since right link nodes 
may become occupants of STACK. ) Once nodes become inactive , their node 
numbers become eligible for re-assignment to new nodes. 
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Since the maximum number of nodes in STACK at one time is (n+l) » 
‘and since each node has only one possible right link, the maxiraurii num- 
ber of active nodes vill be 2(n+l). Thus, the dynamic link table will 
contain at most 2(n+l) node link records. Also only 2(n+l) unique 
node numbers will ever be needed at one time. 


In order to assign node numbers, as needed during traversal of 
the tree, a second push-do\m list PLIST is initially loaded with 2(n+l) 
consecutive integers so that the first integer removed is 1. As the 
tree is traversed, new nodes may be identified. These new nodes will 
he assigned numbers taken from the top of PLIST which pops up the list. 

Kumhers from inactive nodes are placed on the top of PLIST 


which pushes dovm the list. This occurs as soon as the nodes become in- 


active, or between steps snd ',6; of Fig- 6-S (between the 'move 
bade up', and the 'move down -right ') . 

The TPA with the modificai;ions necessary to provide for the 


dynamic link table is sho™ in Fig. 6-10. 


6.61 Maintaining the Dynamic PPR and PIN Records 


6.61l Maintenance of the PPR . As was discussed in sections 6.1l and 
6.51, a complete BSP of the canonical inequality is constructed from 
two components. The PPR from the root to node . P is required together 
with a basic solution of 'the PIN associated "with node P. 

In addition, the PPR is required to form the right hand side 
of the PIN from the right hand, side of the complete canonical in- 

f ' . • 

equality. 
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' 1 FIGURE 6-10 

FLOWCHART OF THE TREE TRAVERSAL ALGORITffivi AFTER MODIFICATIOiJ TO PERl-JIT 
GEi^ERATION OF A DYKAI-ilC LIEK TABLE 
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The above tvo uses req^uire that the PPR be recorded, and updated 
as the various tree nodes are sequentially visited. This represents an 
addition to the pre-order Q?TA. 

The PPR is an ordered list of 0‘s and l‘-.s (left and right' 
branches) along the path from the root node to the current node P. 

A running record of the partial path is kept in a binary vector 
Y(J) ha'</ing n elements. The index of the last element of Y(J) 

■which is recorded represents the. ‘level’, in the tree -Where the current', 
node P' is located. Recall that the ‘level’ associated -^^ith any node 
P ranges from 0 (the root node) to n (the leaf nodes). This level 
is called STAGE(P) in the trees of Figs. 6-2, 6-3, and 6-k. The var- 
iable STAGE (P) is assigned as an attribute to each neir node P in 
the dynamic link table at the same time LLIIIK(P) and RLIHK(P) are de- 
fined. STAGE(p) is retained as part of the node record in the dynamic 
link table. 

As 'the partial path grows down-ward and to the left, O's are 
ad.ded to the list Y(j). As the partial path is retraced back up the 
tree and domaward to the right, the .list Y(j) is first shortened and 
then expanded -^d-th I’s reflecting the rightward move. 

To permit the list Y(j) to be modified as the tree is trav- 
ersed, two pointers PTl and PT2 (^called the following .and ’lead pointers 
respectively) which refer to elements in Y(J) are used. 

As movement proceeds do>nit'ard and to the left in the tree, the 
pointers and the PPR are revised according to the follo^ring rules ; 
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PTl PT2 

PT2 ^ STAGE (-P) (6-20) 

Y(j) 0; J=PT1+1,-' ’ ,PT2, 

These operations expand the list 'Y(j) hy adding zeros. 

Fig. 6-11 illustrates how the PPR is dynamically modified. 
Suppose the partial path and PPK sho™ in Fig. 6-llA exist at some 
time during enumeration of the tree. This is to he regarded as initial 
data. Jfext, suppose a move is made extending the initial partial path 
down two stages to the left. Revised data is given in Fig. 6-llB, af- 
ter using (6-20 ). 

As movement ‘proceeds hack up the tree and then downward and to 
the right , the pointers and the PPR are revised according to the fol- 
lowing rules : 

■ P <- STACK 
PTl ^ STAGE (P) 

P RLIWK(P) (6-21) 

. -PT2 STAGE (P) 

Y(J) -^‘l; J=PT1+1,' ' * ,PT2 . 

For example, starting with the data shoi'Tn in Fig. 6— IIB, assume 
a move is made 'back up the tree and doT^m. to the right. The final re- 
sults are sho-sm in Pig. 6-llC. The PPR Y(j)'= 0 is erased as move- 
ment proceeds hack up the tree, and overwritten mth Y(j) = 1 as 
movement proceeds do™ and to the right. 


















6.642 Maintenance of PIK Coefficiem;s . Provisions are also made for 


dynamically updating coefficients and ri^t hand side of the PIN as the 
tree is traversed. This is another addition to the TTA. 

A list of coefficients C‘(J), J=1,2,'*’M is maintained hy 
\iSing the list of coefficients C(J), -1=1,2, for. the original 

ineq,uality. The C’(j) are copied from the list of C(j) as follows: 

M N - PT2 

K 'S- PT2 + L (6-22) 

C (L) C(K)j L=1,2,... ,M . 

The right hand side D' of the current PIN is then given in 
terms of the original right hand side D as": 

PT2 

Dl ^ D - ^ [C(J)-k-Y(J)3. (6-23) 

■ J=1 

An example of PIN parameter revision using (6-20) and (6-22) 
is shown in Pig* 6-llB. Fig. 6-llC uses (6-21), (6-22) and (6-23). 

This completes the discussion -of modifications and additions 
necessary to convert the pre-order TTA to the TPA. A flow chart of 
the TPA is shot-m in Fig. 6-12. A detailed description of this flow 
chart is presented in the next section. 
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FIGURE 6-12 

FLOW CHART OF THE TREE ERUHIIJG ALGORITffi'I FOR SOLVING PSEUDO-BOOLEAII 
INEQUALITIES I!J CAIfOKICAL FOR^i ‘ 



-SH 


STACK-'-P 


PT2«-STAGB(P) 


0^ 




@ 


STOP 






. 



[ © 

Adj-ust pointers for a 
move down and to left.. 


Adjust working solution for a 
move up and down to right. 

PTl <- PT2 


Y(J) -e- 1 

PT2 -i- ST.AGE(P) 


- 

= PTl + 1, . . . , M2 
:: 1 

1 

1 

1 0 


1 

f © 

Adjust i^orking solution 
for move to left. - 

Y(0) -<-0 

J = ITl + 1 ,FT2 


Update working cocfficiem; list. 
M W - PT2 
C'(L) <- C(L + EP2) 



X, - JL; . . . 


© 


i ® ■ 

VISIT . ? : De f i ne ( XI , X2 ) 

LLINK(P) ; STAGS (LLIHK(P) ) 
FtLIKK ( P } i STAG E ( RLIHK { P ) ) 
P-<-PLIST for new nodes 

jS£— — 

Update t.’orking right hand side, 
r PT2 • 

T r 



O -J. 




INITIALIZE 


PLIST; P'-ROQT 




START 


© 


Note: Al£;oi'itSim stepa are niunbered to correspond to the dsscia plions (jiveii in the text. 




















6.7 The Tree-Pruning Algorithm (TPA) 
6.71 Detailed TPA Description 


Xk2 


The TPA is described by the flow chart of Fig. 6~12. The var- 
ious operations shown in this flow chart are numbered and the written 
descriptions given below refer to these nimbered operations. 

(1) ' Initialize. The push-dovna list of new no'de numbers PLIST 
is loaded with sequential integers 1,2 »••• ,2(1+1’) . The first integer 
(unity) is removed from PLIST and placed in P to correspond to the 
root node. STAGE (p) +• 0 for the root node. 

The PPR is undefined at the' root node. The pointers PTl and 
PT2 are both set to zero. ^ ■ ■ 

The PIN parameters |c’(j) and D* are set equal to the can- 
onical inequality parameters C(j) .'and D. 

( 2 ) Visit P. 'The PIN associated with node P is -classified 
using Fig. 6-6, XI is the classification case number and X2 is the 
number of PIN basic solutions identified. If nodes linked to P are 
identified, they are assigned numbers from the push-down list PLIST, 
These node numbers are entered in the dynamic link table as LLINK(P), 
KLINK(P) or both. They appear as part of the node record. Also 
each node linked to P has its attribute STAGE(LLINK(p) ) , STAGE(RLINK 
(P)), or both recorded in the dynamic link table at this time, 

( 3 ) Test for basic solutions-. If X2=0, then no PIN basic 
solutions were identified when P was visited. 



1^3 


(k) Construct and record all complete BSP’s discovered. All 
PIN ‘basic solutions (X2 of them) identified in step (2) are used here. 
Each complete BSP is constructed using the PPB and a PIN basic solution. 
The form of each PIN basic solution is determined indirectly by XI 
from step (2). 

(5) STACK-ep. [phe number of the node just visited is placed 
on the top of the push-down list STACK. 

(6) P-eLLINK(P) . The number of the node just visited in step 
(2) is replaced by the number of its left link node. The movement is 
downwar'd and to the left in the tree. 

(7) Test for null link. Here the test -P = ~ 1 is performed 
to determine whether the node visited in step (2) has a left link to a 
new node. If no link node exists dOTOward to the left, then control 
transfers to step (8) for a move back up the tree to the node visited 
in step (2). This is followed by a move down and to the- right. If a ' 
left link ' does exist to a node further do^vn the tree, then control 
transfers to step (19) for updating the PPR and the PIN parameters. 

(8) P-<-STACK. The number of the node visited last is removed 
from the push-down list. This node is the pivot node for a move around 
the corner and down to the right . 

(9) PTl-^STAGE (P ) . The following pointer in the PPR is moved • 
back to the stage of the pivot node (sometimes this step res'olts in no 
actual movement of the pointer) . 


(10) PL'IST<-P, Since the pivot node will not be needed again, 
it becomes inactive and its node number is released for future use by 
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new nodes. This is done by placing the node number back on the push- 
•, down list PLIST. 

(11) P^ELINK(p), The number of the pivot node P is replaced 
by the number of its right link for a move down the tree and to the 
right. 

(12) Test for null link. Here we test whether the right, link 
of the pivot, is non-null. If it ^ null , then go to step (13) to 
test for an empty STACK. If it is not null , then go to step (15) to 
prepare to visit the node. 

(13) Test for empty STACK. If the push-do^m list STACK is 
empty, then the algorithm is terminated at step (l4) and the tree has 
been completely traversed. 

(14) STOP. The tree has been traversed. 

(15) PT2 STAGE(P). The lead PPR pointer is moved ahead to 
correspond to the move down the tree, { STAGS (P) was established in 
step (2)). 

(16) Y(J) 1; J=PT1+1,* • * ,PT2. The PPR is expanded to re- 
flect the move down the tree and to the right, 

(IT) M N - PT2; C' (L) ^ C(L+PI2), L=1, . . . ,M, The PIN coef- 
ficients are updated to correspond to the node E which will.be visited. 

FT2 

(18 ) D' D - ^ [Y(J)'‘^C(j) ]. The right hand side is adjusted 
J=1 

to correspond to the PUg associated with the node P which -vrill be 


■Vi si bed. 
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(19) PTl PT2;PT2 -t- STAGE (P ) . Advance "both leading and fol~ 

' loving PPR pointers to correspond to a move down the tree and to the 

1 

left. ■ ' 

(20) Y('J)- •<- 0;J=PTl-fl' ’ ' ,PT2. The' PPR is expanded to reflect 
the move dovm the tree and to the right. 

6.T2 Example Problems 

T^^-o example problems are given here. First, the very simple 
example used in section ’6. 61 and shovm in'Fig. '6-2 is presented here 
in detail. This example shows step-by-step operation of the TPA. It 
is discussed in 6.721 below. 

The second -example ^(in' section 6.722 below) illustrates the 
entire LPBI solution process. This ^ includes : 

I 

(a) illustration of the parameter transformation to canonical form; 

(b) an overview of basic solution determination using the TPA; 

(c) generation of canonical solution fanilies from basic solutions; 

(d) transformation of canonical solution families to general solution 
families . 

The LPBI used in the second example is the same one discussed 
in section 6,352 and illustrated in Figures 6-3 and 6-4. 

6.721 A Detailed Example of the TPA . Fig. 6-2 A shows the complete 
solution tree for the ineq^uality 

3x^ + 2Xg + x^ 3. ^ 
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By applying the TPA of Fig. 6-12 to this tree, we can solve the in- 
equality. The general method of doing this was illustrated in section 
6.6l. Fig. 6-13 shows the detailed results as the TPA of Fig. 6-12 is 
applied. Each step of Pig. 6-13 corresponds to a numbered block in the 
flow chart of Fig. 6-12. The status of all data structures except the 
"link table is shown in Fig. 6-13. The status of the dynamic link table 
is illustrated in Fig. 6-li| as it is modified during the tree traversal. 
Only two records appear in this link table because only two tree nodes 
are visited before all solutions are found. 

6.722 Solving the General Form Inequality . This example follows fche 
solution of the inequality 

-2z^ - 3Zg + 5^2 " _> 0. 

The solution tree to this general form inequality is illustrated in. 

Fig. 6-3. The transformation of the inequality parameters to canon- 
ical form is shown in Fig, 6-15A. The same transformations were used 
as an example in section 6.321. They are presented again in 6-15A 
with other transformations required for the complete solution of this 
inequality. The solution tree associated with this canonical inequality 
is shown on Fig. 6-k. 

The- application of the TPA of Fig. 6-12 is illustrated below to 
find the seven basic solutions indicated in Fig. 6-15B and in Fig. 6-k. 
Wode visits are presented sequentially and detailed results are shown. 
Each paragraph below corresponds to a single node visit. The grorth of 



FIGURE 6-13 


EXAMPLE PROBLEM SHOWIJIC- DETAILS OF THE TREE PRUUIHG ALGORITHM 
FOR THE INEQUALITY 5Xt_ -i- EXg + X3 > 4 
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P 
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STACK 
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XI 
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Step refers to algorithm step shown on flow chart and described in the text 
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FiCURS 6-llj 


COKriKUASIOH OF EXAl.IPLE PROBLEM SHOWIUG" DETAILS OF TREE PHUHING ALGORITSDi 
Original Data (Canonical Form) Exploration Path 
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the pruned subtree resulting from the node visit is also sho^-m graph- 
ically. 

(a) Visit node 1 (root)„ The node records give PIN = 5x^ + 

3Xg + 2x^ + 2x^ i 6 and PPE = x = Node classifica- 
tion parameters are given by: d > 0; = 13 > d; p = 0; and = 8 > 

d. Node classification is case T* There are no basic solutions and no 
exclusions. • Advance one stage do™ both ri^t and left branches. De- 
fine Wo new nodes. Label them 2 and 3. The tree is now defined as: 

0 

3 1 . 

(b) Visit node 2 at stage 1. Tlie node records give; PIN = 

3Xg + 2x^ + 2x]^ + x^ ^ 6. and PPR - Y = (O,-,-,-,-;. Node classifica- 
tion parameters are given by: d> 0; s^=8>d;p = 0; 3^ = 5 <d; 
and n = it > 1. Node classification is case 6. There are no basic 
solutions. Exclude the left branch, and advance one stage do™ the 
ri^t branch. Define a new node. Label it 2, since the pivot node 2 
has become -inactive, and its number may be used over again. The tree 
is now defined as ; . ■ 



(C) Visit nods 2 at stage 2. The node records give: PIN = 
2x^ + 2xj^ + 3. 3 and PPR = _Y = (0,l,r5-j-)‘ Node classification 

parameters are given by: d>0;s^=5>d;p=0 and Sg = 3 3 , d . 
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Node classification is case 7* There are no basic solutions. Exclude 
neither the right nor left, branches . Advance one stage doTO both the 

I 

left and right branches. Define two new nodes. Label them U and 5. 
The tree is now defined as : 



0 

1 

2 


(D) Visit node ^ at ‘stage 3 - The node records give: Pllf - 

2xj^ + L 3 and PPR = . Y,.= (0,lj0 Node classification para- 
meters are given by d > 0 and , = 3 = d; ‘ Node classification is 

case 2. Thex-e is a xinique liasic solution. Exclude both left and right - 

i 

branches. Define no new nodes. Thd PIN basic solution is (x^,x^) = 
(l,l) and the BSP is x = (O-jljO,!,!). The tree is now defined as: 


1 



IBSP 


c5 

1 

2 - 


(B) Visit node 5 at stage 3. The node records give: PIN = 

2xj^ + x^ ^ 1 and PPR = Y‘~ (0,1,1,-,-). Node classification para- 
meters are given by: d > Oj s^ = 3 > d; and p = 2 = n. Node class- 
ification is case lx. There are two basic solutions. Exclude both left 
and right branches. Define no new nodes- The- PIN basic solutions are: 
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= (ijO) and (Ojl). The BSP's are:, x = (OjljljljO) and 
{0,1, 1,0,1), The tree ’is new defined as: 



.(F) Visit node 3 at stage 1. The node records give: Pllf ~ 

3^2 .21 ^ PPR = Y (1, , Ifode classifica- 

tion parameters are given by: ' d > 0; s^ = 8 > d; and p = U = n. Node 
classification is case 4. There are four basic solutions. Exclude 
bonh right and left branches. Define no new nodes. The PIN basic 
sol-ations are = (l^Q,0,Q) and (OjljOjO) and (0, 0,1,0) 

and (0,0, 0,l). The BSP's are: (l,l,0,0,0) and (l,0,l,0,0) and 

(l, 0,0, 1,0) and (l, 0,0, 0,1). The tree is now defined as: 



(g) The tree traversal ends . All nodes which were defined 
have been visited. 

By visiting six nodes in a subtree (out of a possible 63 nodes 
in tbs complete tree), seven basic solutions were found. 
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— Figure 6-4 shovrs tiia'fc-"tlie 3 ?e -are 19 binary solution vectors 
to the inequality. These solution vectors are clustered in families 
to the right of the- hasie solution path. There are an average of 
19/T=^2.T solution vectors per farBily for this- problem. Fig. 6-15E 
illustrates the conversion of the basic solutions to canonical solution - 
families, All trailing O’s are changed to (-) to indlcat;e arbixrary 
(O/l) variables. Fig, 0 - 15 C shovs the transformation of the canonical 
solution families back to the general fora. This transforma-cion t-alces 
pl.aee in two steps. First is the inverse permutacion. Hext^ the com- 
plementsd variables are . aeeoxm-oed for. 


6 , TS Mi scellan ecus > 


Fig. 6 "l 6 is a-n enumeration of the transformation from canon- 
ical form solutions to general form solutions for the example problem-; 
of Fig. 6-4. In the left column- of Fig. 6-16 are the 32 binary vectors 
X. - (x^ ). In the right coluran are the corresponding. xrans- 

i?_L HI)? 

1 X 

formed binar-y vectors = (z^, , ‘ ‘ ” szTlt-) . The families' of solutions 
indicated -on Fig. 6-15E and 6 - 15 C are sho™ grouped in Fig. 6 -I 6 . 

Using Fig. 6 -I 6 , the follotrLng items can he noted. 


(A) 





_L 



j=l 


gj where 


g = 



= 6 . 


Ttiis* is an illustration of result (6-18). By (6-19) the transforrea. 
tion is order pres63nring. 



FIGURE 6~l6 


EXAivlPLE PROBLEt'i SHOWING EKUIvIERATIOK OF SOLUTION VECTOR BEF'ORE 
I Al'ID AFTER TRANSFORl-IATIOW 


Combination 

number 

Variable (xj) 
1 2 3 4 5 

E c 
0=1 

1 
.XT 
0 J ' 

Variable fz^) 
1 2 3 4 ^ 

,j=l ^ 

32 

31 - 

30 

29 

28 

27 

26 

25 

11111 
11110 
1110 1 
•1110 0 
110 11 
110 10 
110 0 1 

110 0 0 

13 ’^'^ 
12« 
11 X 
10* 
11* 
10-* 
p* 
8*^ 


0 10 0 1 

0 1.011 

0 10 0 0 

0 10 10 

110 0 1 

110 11 

1 1 0 b 0 
1 1. 0 10.. 

7 -^ 

6* 

5 -* 

4 * 

5 ^ 

4 * 

3 * 

2«J 


2 h 

23 

22 

21 

i 0 1 1 1 
10 110 
10 10 1 
10 10 0 

10 ft-- 
9 « 
8* 
7 *- 

F^(x) 

0 0 0 0 1 
0 0 0 1 1 
0 0 0 0 0 
0 0 0 1 0 

4 ^ 

35? 

2* 

F^(Z) 

20 

19 

l8 

1? 

l6 

15 

Ih 

13 

12 

10 0 11 
10 0 10 
1 0 0 0 1 
1 0 O’ 0 0 

0 1111 
0 1110 
0 110 1 
0 110 0 
0 10 11 

8*" 

7 *- 

6*:: 

5 

a*" 

7 *^ 

5 

6 K 

PTD 

1 

j Fy(x) 

F2(X) 

F3M 

I Ff(X) 

1 0 0 0 1 
10 0 11 
1 0 0 0 0 
10 0 10 
0 110 1 
0 1111 
0 110 0 
0 1110 
1110 1 

1*^ 

0*^ 

-1 

2«'' 

A 

-1 

oi;: 

F-7 (^) 

> T? (7.) 
1 - 2 ^- 

J F^(z) 
J Ft (z) 

11 

0 10 10 

5 


1 1 1 1 .1 

-1 


10 

0 10 0 1 

4 


1110 0 

■' -2 


9 

0 10 0 0 

3 


11110 

-3 


8 

0 0 111 

5 


0 0 10 1 

-1 


7 

0 0 110 

4 


0 0 111 

• -2 


6 

0 0 10 1 

3 


0 0 10 0 

-3 


C 

0 0 10 0 

2 


0 0 110 

_4 


4 

0 0 0 1 1 

3 


10 10 1 

-3 


3 

0 0 0 1 0 

2 


10 111 

-4 


2 

0 0 0 0 1 

1 


10 10 0 

-5 


1 

0 0 0 0 0 

0 


10 110 

-6 






5X, 


3X. 


2Xj + 2Xj^ + X^ > 6 


-2Z^ - SZg + - Zj^ + 2Z^ ^ 0 


NOTE: 




i=i 


= - E 


(y .<0) 


Y . = g = 2 + 3 ■*■ 1 “ ‘6 
J 
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(B) The canonical form vectors are shown in standard order, 
'.hut note that 


■ 1 ^ V” 

“ic L ^ L 


for k > ii, 


j=l 


0=1 


which shows that the sequence u^, k=l,2,*->,32 is not mono tonic ally 
increasing. For example, < Ugg. Note also that = 6, so 

x ^2 is a solution vector . However, and x^.^ are not solutions . 

Now consider an enumeration scheme to determine all ^ such that 
^ ( 0 , where w is a given constant. Sequentially select binary 
vectors 2 ^ starting at the top of the list (k = ^), form u^ and 
work downward until u^ < m. This scheme will not guarantee that all 
u ■ ^ (0 have been found. It is not an acceptable alternative to the 
TPA. 

(C) Associated with each family of solutions is a range of 

values 


a _< u(F^ ) £ b 


instead of the single value associated with an individual solution 

vector. Even though two families are disjoint, their ranges of u(F.) 
may be overlapping. For example, using Fig. 6-l6, 8 £ u(F|j) £13 and 
7 < u(Fg) < 10. It can be seen that if the range of a canonical fam- 
ily Fj^(x) is u^^ < u(Fj(x)) < VLj^, the reinge of the corresponding 
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general form solution family 


- g, where 
suit. 


g = ~ 


z 

(yt^o) 

u 



F.(£) is given oy u - g _< u(F.(£)) < 

0 h . j 

This is a convenient coraputauional re- 


6.t 4 The Use of Solution Families in a Document Retrieval Sj'-stem 


We can identify each hinaxy soliition vector z, with a nartic- 

ular combination of index terms. Each z, may have m documents 

-k 

associated with it, m=0,l,2,***, and- .each of these m documents is 
predicted to be relevant. 

A family of solutions F.(z_) specifies a group of index term 

J 

combinations which has relevant associated documents. The BRS con- 
sisting of the union of all the solution families will retrieve all 
docinaents from the file which are predicted relevant (have a, utility 
u 2 ^ t ) . 

When the solution families F.(^) are considered with respect 

J 

to a document 'retrieval system, several observations can be made about 
the usefulness of a BRS as derived from the LPBI. 

(a) The BRS which consists of "Che union of F.(^) has the 

t) 

same exact form as the heuristically generated BRS which is the man- 
machine link in many existing systems. This prc\ddes a model with 


analytical end results which parallels the end results of a human being 
in current systems. 


(p) The solution families F. (z_) are mutually disjoint. This 
» J 

means that the BRS derived from the union of the F.(^) will never 

J 
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retrieve any document more than once. The BRS "which is heuristicadly 
generated cannot be guaranteed to have this property. 

(C) Tlie cost of searching the file using a family of solutions 
( 5 .) is mtich less thai’i the cost would be if an equivadent search were 
run using each member ^ of the 'family separately. 

(d) a disadvantage of searches made using solution families is 
that any documents retrieved by a solution family F. can have a pre- 
dieted "Utility spread over the range a_<u(F.) _ 5 _h. and the predicted 
utility of a given retrieved document can he obtained exactly only "id.th 
increased computation. The indi"vldual document utilities may be desired 
when a large n"amber L of documents are cited as being relevant (pre- 
dicted) by a BSS. Tlie user may not have enough time to review all re- 
trieved documents and may want only the subset of doc"uments having uhe 
highest predicted utility. In this, case the utili'ty "Uj^ caji be deter- 
mined foi* each docume"nt in the retrieved set by using the index term 
weights. The set of L documents can then he ranked and the H docu- 
ments with the highest predicted utility presented to the user. In 
this case enume'ration of document utilities is restricted to only the 
set of those predicted relevant ^ and this is usually a very small sub- 
set of the entire file. 


6,75 Comp'uter Implementation of the TPA 

A computer program for solubion of the LPBI using the TPA has 
been "?}ritten in Fortran IV for the IBM 709^/70^0 Direct Couple System. 
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Four subroutines control the solution of the LPBl and the out— 


'put of data. 

(a) Tiie first subroutine forms the LPBI from the LUPF and con- 

verts all LPBl coefficients to integers. The LUFF as passed to this 
subroutine has real coefficients Goiiipleitten.ted variables 

(a. = 1 - for all j). This subroutine converts all y. to integers by 
scaling end truncating. Accuracy of the conversion process is variable 
and is set by program parameters. 

(b) The second subroutine transforms the integer LPBl para- 
meters to canonical formj and finds all basic • solutions to the can-- 
oniceil form. The basic solutions are written in groups of fixed size to 
an output 'device for temporary storage. 

(C) - The t'nird subrouLiiie produces canonical sol^iticn families 
from the basic solutions} and transforms the canonical solution fam- 
ilies -to get solution families to the general form LPBT. Basic solu- 
tions are read from the storage device to core in groups, are con- 
verted to- general form solution famlies in core and then are again 
stored in 'groups on the output device. The range of u(F.) and SIZE 


S(Fj 


11 

= 7 F.. 

i^ 


are also recorded mth each solution' family.' 


(d) The fourth subroutine "^-rlll output solution fairllies to a 
printer, or other device and which if desix'ed will screen solution 
families on the basis of range or size and suppress printing of certain 
solution families if desired. 
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All foux subroutines are under the exclusive control of a 
driver program. (No subroutine calls any other subroutine.) The re- 
sulting modular system is convenient to use and modify, 

6.j6 Computational Experience wit’n the TPA 

Experience with the TPA has been rather limited. Table 6-1 
gives some performance data for lU sample problems. The largest prob- 
lem solved had only lit variables. 

By using these sample problems and by making some assximptions 
which seem reasonable based on the data of Table 6-1, rough estimates 
were obtained for larger problems. The assumptions are listed below. 

(a) The number KV of nodes visited during solution of a 
LPBI increases exponei-tieilly with the number of problem va.riables n. 

Y n 

NV A e ° (6~2k) 

- . o 

Pexameters Aq and Tq experimentally determined constants. 

(b) The n’jmber of basic solutions identified is proportional 
to the number of nodes visited. 
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■(d) a constant fraction of all- basic solutions will be 

degenerate and (l - a^) vill be non-degenerate . 


BSOt, = DBSOL -h I'IDBSOL 


DBSOL = a^i^SOL 
EDBSOL = (l - a2)BS0B 


(6-27) 


(e) The 


number of solution points -in a non-degenerate 


solution family is an increasing function of n, FS(.n) . The analy- 
tical form of this function can be derived from assumptions (a) - (d) 
abo\re as follows. 

For the total number of solutions we can write two equivalent 
•expressions-; 


TS = e 


0. 69311 


(6-26) 


TS = DBSOL 4-- (NDBSOL)FS(n) 


( 6 - 28 ) 


= oojUgAoe 


Y n 

(1 ~ ® ° 


Equating (6-26) to (6-28) and solving for FS(n) gives 


FS(n) = 


0.693n . "'^o^ 

«1 e . - e 

Y n 

a„(l - a„)A_^ e ° 



TABLE 6-2 

St.500THED ASD EXTPAPOLATED ESTIMATES OF TPA PERFOR.MASCB 
A. Estimated Solution Time as a Function of Problem Site 


W\jmber of 

Expected 

i Total node visiting time (sec 

) at node visit- 

variables 

node visits 

ii 2 g rates (R) shown belcv (nodes/sec) 

n 

!iV 

s = 500 

E = 1000 

R = 2000 

10 

85 

0.17 

0.0850 

0.01)25 

15 

121 

2,1)2 

1.21 

.605 

20 

17,800 

35.60 

17.3 

8.65 

25 

250,000 

500 . 

250 . 

125.0 

30 

3 , 627,000 

7260 . 

3630 . 

1815 , 


(121 min) 

( 60.5 min) 

( 30.25 rain) 


B. Numiber and Type of Solutions as a Function of Problem Size 


Number of 


1 . 

20 , 

289, 

l|,200. 


Ifondcgenerate 

Begcnorave 

Average 

basic solutions 

basic solutions 

sols/ fara 

HDBSOL 

DBSOL 

FS(n) 

59 

39 

10.3 

81)0 

560 

23.*) 

12,350 

8,250 

52.1) 

173,500 

115,500 

117.6 

2,520,000 

1 , 700,000 

260.3 


Total solutions [ Total numoer 


in families 
(NDBSOL)FS(n) 


S.O8xl02 

1.97x10** 

6.1)7x105 

2.03x10^ 

6 . 56 x 10 ® 


solutions 
S 


6.1) 7x102 
2.03x10** 
6. 55x10® 

2.01) x102 
6.58x10® 
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From the data of Tah'le 6-1, estimates of the parameters are: 

! a, = 0.622 
‘ 1 

^2 = -1.155 

=-0.400 ( 6 - 30 ) 

A = 0.403 
o 

Y'^ = 0.5345 

and it follo^rs that' (6-29) then, hecomes: 

FS(n) = 2.23 - 0.6t . (6-31) 

t 

Ihe results of applying the above assumptions (6-24) to (6-31; 
for selected values of n are shovn in Table 6-2B. If one computer 
vord is used to -store each hasic sorution, it appears that the storage 

I 

problem for the 25 variable problem is excessive, with 289,000 basic 
solutions expected. The 20 variable problem appears more reasonable, 
ATith 20,000 basic solutions expected. 

Table 6-2A. shows the expected processing time based on three 
different average node visiting rates . The current node visiting rate 
is about 500 nodes /second . With some very trivial program modifica- ^ 
tions, this can be extended to 1000 nodes/second or above. The 25 
variable problem at 1000 nodes/second will req^uire an estimated 250 
seconds for solution. This is considered excessive, and the 20 

\^he reader is cautioned that the variances of the parameter 
estimates are ojiite large. Smoothed a'nd extrapolated dp3;a based on 
these parameters is intended for rough estimaxes only. Data is also 
peculiar to the application here, where index term Av'eiglits are derived 
using approximation theorj’-. 
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variable problem again appears more reasonable with a 17.3 second 
total. 

Times given are for the TFA which finds basic solutions to the 
canonical form. Subroutines which transform parameters to canonical 
form and which transform basic solutions to general solution families 
require much less time than the TPA. Their contribution to total 
processing time is ignored here. 

In conclusion, the TPA appears adeq_uate for solving LPBI's 
with up to 20 variables. For the 20 variable problem, the expected 
processing time is 17.3 seconds (at a node visiting rate of 1000 
nodes/second) . For the same problem, expected storage space is 20,600 
.words, assuming one basic solution per word. Both solution time and 
storage reQ.uirements appear T’easonable for applications related to 
doc^Iment retrieval systems. 
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T.O EXPERIMENT DESIGN MD PRESENTATION OF DATA 

This chapter discusses the experiment design configuration se- 
lected for test purposes and presenbs the raw response data. Test ob- 
jectives and the various measures "of search effectiveness are also 
discussed.- Analysis of the experimental data is deferred to chapter 

8 . 0 . 


T«1 Test Objectives 


The test program had three objectives, 

(a) First, to determine whether significant differences in 
search effectiveness exist between searches performed using machine- 
geiiei’ated BRS's and sea.roues using BRS^s geiiera.ted ]ieur is tic-ally by 
humans-. 


(b) Secondly, to help .determine -the causes of these differ- 
ences , if. they exist . 

(C)' Finally, to provide an overview of the whole process and 
suggest areas for fur.ther research. 

Before presenting test details, it is convenient to discuss 
figures of merit used to 'evaluate the ' effectiveness of document re- 


trieval systems . 
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T.2 Measuring Search Effectiveness 
I 

Three measures of ‘effectiveness are used here to evaluate test 
results . All are hased on entries in the f olloiri.ng 2x2 contingency 
table. 


Retrieved Rot Retrieved 


Relevant 

^11 

^12 


Rot Relevant 

^21 

'^22 

“ 2 . 


■""• I . ■ : 

: / ^2 

-I'T 

1 


i 


For each search » a contingency table identical to (T-l) can be con- 
structed. This assumes tha^ all relevant documents are knoi^m ,• ■whether 


retrieved or not . 


7*21 Recall and Precision 


Two standard measures of search effectiveness based on the con- 
tingency table are recall and pre cision . These measures have been pro- 

(glj 95) 

posed and used by several authors ’ 

The definitions are: 


Recall = 



relevant retrieved 
total relevant 


(T-2) 


Precision = 



relevant retrieved 
total retrieved 


(T-3) 



Roughly, recall is a measure of how well the system retrieves 
all the relevant material, while precision is a measure of the economy 
of the retrieval process. Variations of the above definitions of re- 
call and precision are occasionally used. See, for example, Salton^'^^^ 

'7.22 The Information Statistic as a Measure of Search Effectiveness ■ 

A disadvantage of recall and precision is that a pair of num- 
bers are involved instead of a single. figure of merit. An alternate 

measure based on the 2x2 contingency table has been proposed and used 
(97) 

by A. R. Meetham, which -gives a single figure of laerit for the 
search effective'ness. 

It is identical to the information measure R deserihed in 

r»~h it 9 


A. 

R = H(X) 


H(X/Y) = 


± 





(4-8) 


This computational formula, was derived in section 4.25. 

t /V 

Recall that R. is the gain in information (reduction in 'en- 
tropy) which occurs (on the average) each time new information p(Y) 

Is used to convert a prior distribution p(x) to a posterior distribu- 
tion p(X/y). The prior distribution p(X) is an initial assignment 
of probabi3_ities to states of nature and the posterior distribution 
p(X/y) is the revised probability distribution after observing aux- 
iliary d.ata, or the results of an experiment y. (See chapter A. 0. ) 
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The information measnre R is used in chapter 4 to select the (most 

/V 

discriminating) index terms for inclusion in the decision function. R 
is used here to evaluate document retrieval system effectiveness. Ihis 
allows a new view of the retrieval process as a prior to posterior 
probability distribution adjustment. The prior distribution is the 
prdbahility of a document in the file being relevant, given that it is 
'dratm -at random, and with no knowledge of index terms etc. , which are 
associated with uhe document. The posterior distribution is the prob- 
ability that a document which is selected by the retrieval system is 
relevant. (This selection is based on the index terms.) 

The retrieval system can be viewed as an. automatic processor 
which performs -an auxiliary experiment on the index terms associated 
with a document and then by using a built-in decision rule on these 
experimental resiolts , offers a suggestion to the user as to whether the 
document is relevant or not. Afber seeing the document the user makes 
a final' decision about its relevance. The degree of agreement which ex- 
ists between tlie judgements made by the retrieval system and the user 

A, 

is the measure B of how well the system operates . 

A perfect retrieval system would make decisions (suggestions) 
about document relevance which would always agree with the user judge- 
ment. The system suggestion would then remove all uncertainty (for 'bhe 
user) about document relevance. In this case R = H(x). Any real 
system of course will not be perfect. As a consequence we will have 

A 

0 < R < II(x). 



l69 


Define ; 


Then we have: 


'a = 100[R/H(X)]. 


0 < a < 100. 


(T-^) 


The variable a .is_the normalized, information statistic (NIS) and is 
interpreted as the percent effecxdveness of the retrieval system. It 
can be thonght of as the average percent reduction in uncertainty about 
doctanent relevance, if the system suggestion regarding document rele- 
vaiice is followed. The measure (7— i) will be used in the experiment 
described here to evaluate the retrieval system^ in addition to recall 

f •-7 «... ( n ■z^ I 

cyiu jjl/KisC^ub j.vjii I 

I 

The relation of the NIS to ‘recall and precision is shoim in 
Fig. 7-1, for a file similar to the one used for, test purposes. It 
can he seen that recall and precision are both' strictly increasing 
functions of the iKS. Thus, increasing the NIS will never degrade 
either recall or precision. 

7-23 Othei' Applications of the Information Statistic 

(98^ 

'The NIS as described here vas used by Shirey ' to evaluate 
the efficiency of document abstracts and first - last paragraph com- 
binations at predicting document relevance. After reading these rele- 
vance cue indicators 5 the users were asked to make a.judipnent abovib 


the relevaxice of the fbll document. After this firsc judgment was 









ITl 


obtained, the users were shown the full document and asked for a second 
final opinion of relevance. The preliminary and final results were 

A 

analyzed ana ?./K(x) was computed. In this application the use of 

relevance cue indicators constitutes an experiment performed to provide 

more information about document relevance. 

(99) 

Pi. H. Shumway has also noted the potential use of the in- 


formation statistic K as an overall measure of retrieval system 
effectiveness . In addition , he 're~analyzes the Shirey data assuming a 
three-way table relation. He demonstrates that the two-way table used 
by Shirey for his analysis is really a .special case of multi-way con- 
tingency tables. These can be •analj’'zed using an information measure 
which is partitioned in. a manner similar to the sum of sq,uares in the 
analysis of vamsnce. The generaj method rs treated by Iv^illback . 


7.24 Summary 

A 

The information statistic H described above was developed in 
chapter 4 for selection of index terms (a form of feature extraction). 
It is used again here in its normalised, form (T~4) as a figure of merit 
for evaluating retrieval systems. 

It has been both used and proposed by others for extracuion of 
pattern features (see section 1.4), evaluation of search effectiveness, 
evaluation of relevance cue indicators, and general contingency table 


analysis , 
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7-3 Experiment Design 


7.31 General 

A 2“ factorial experiment; was designed to determine whether 
retrieral system effectiveness . is influenced hy: 

(a) methods (BK-S’s generated .hy machine , vs. BBS's generated by 
people); 

(b) number of index terms used in the model (a high level of about 15 


terms and a low level .of about 5 terms ) ; and 

{ 0 ) number of documents, in the training set (50 documents at a high 

> I 

level and 25 documents at a low level). 

3 ■ ■ ' 

The 2 factorial configuration was replicated four times , with 

each replication (of - 2'' =‘ u points; being a sepai’ate query to the 

I 

system. This allov-red variability existing between questions to be 
accounted for. ~ ■ 

One month of the HASA file (a total file size of i^88l docuaients) 
was searched using the different BBS's. All the documents relevant to 
the four queries. were identified before the searches were performed. 

The figures of merit for each search were uhen computed from the 2x2 
contingency tables (j-l) constructed after completion of the searches 1 


7.32 Selection and Preparation, of Test Questions 


The four questions used as replicanes were selected at random 
from a group of actual queries in an information sysnem. Each question 
selected had an existing associated group of abstracts rated relevant 
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or non-relQvant by a tiser. There were enough existing abstracts to 
'construct a 50 document training set. 

Before the, training set was finalized, the month to be searched 
(March, 19^9 ) was. chosen at I’andom^ and all- abstracts in the training 
^et for this month were removed. The 50 abstracts remaining in the 
training set were from -sri-thin six months before and after- -che search 
month of March, 1969- 

By using the training set abstracts , a detailed question des- 
cription was written. Nine meaningful and identifiable subcategories 
for each question were de-vised, and each subcategory was assigned a 
utility from 1 to 9. Each of the 50 abstracts was then placed in one 
'of the nine subcategories, and a utility threshold t was introduced 
which designated which of the subcategoriss were rsle-'/'ant and which 
were not. With the questions well defined by the training sets and the 
written descriptions, a complete manual .search vxas performed over the 
March, 19^9 portion of the file and all relevant documents for each 
query were identified. 

A 25-doe-uflient training- set was created for each question by se- 
lecting 25 documents from each 50-doeument training set. {The 50 docu- 
ments were ordered sequentially by their file numbers , and then every 
other number was chosen. Since file numbers are unrelated to utilities, 
this selection method is believed unbiased. ) This gave eight training 
sets , one of 50 and one of 25 documents for each of the four test ques- 
tions, Preparations for testing were completed by assembling a ’pack- 
age' for each of che eight training sets. This package consisted of: 
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(a) a seq^uential listing of all document numbers, the utility assigned 
to each, and the set of associated index terms; 

(b) the utility threshold t defining relevance; 

(c) full abstracts of each training set document, grouped by utility 
sub-category , with each group also marked. as being relevant or non- 
relevant ; 

(d) one-sentence abstracts of each training set document, grouped and 
marked as in (c) above. 

. The 32 experimental BRS's were next derived using the above 
training sets. For- each of the eight training sets two BRS's were con- 
structed; one using five index terms and the other fifteen. This was 
repeated for two methods of BRS construction (machine and analyst) to 
give a total- of 32..combinations . 

The machine generated BRS’s (l6 of them) were constructed using 
the methods described in previous chapters. First, best single index 
teimis were selecued. Next, the LUPF, was fit to the assigned document 
utilities. Finally, using the utility threshold, the LPBI was formed 
and all solution families were found. Only items (a) and (b) in the 
training set packages were utilized by the machine system. 

Another l6 BRS's were constructed heuristically by four exper- 
ienced information analysts. Each analyst was assigned one particular 
combination of training set size (25 or 50 ) and number of index terms 
(5 or 15) for each question. There are four such combinations per 
question; one per analyst, ‘Each analysb was assigned only once to 
each of the four combinations. 
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The analyst vas then req.uested to construct a BES for this par- 

■ ticuLar comibiiiation. . The effect of different analysts is considered to 

! 

he an integral part, of .the subjective method (method 1) , The analysts 
iitilized items (a), (b). (c) and (d) of the training set packages. 

They 'i-rere not ^ however given the Q.uestion description. They, were re- 
q^uired to infer the guestlon meaning by reading the abstracts for rel- 
evant and non-relevant documents and by noting the utilities assigned 
to each abstract. Each analyst was given a maxi-mum of one hour to 
VTrite the BR-S assigned to him. 

Finally s the file was searched using, each of the 32 BBS's. 
Searches j using the -BRS's generated by the information analysts were 
made with an existing eomputer program. The machine-generated .BBS's 
’irers net used directly as slearch instructi ous - Instead, the equivalent 

I 

sets of xndex term weiglits were use'd = 

7.33 Classification of. Variables in bhs Problem 

It is convenient to. place the problem variables inuo three 

groups . 

7,331 Independent Variables Control-led, as .Part, of the Problem , This 

includes Methods (m) , where M. is the subjective method using human 
'♦ X 

aiialysts and Is the machine method. This factor is fixed and 

qiiaiitative. Analysts aijpear implicitly as pa,rt of bhis factor. 

Also controlled i/ere the number of index texmis (T) appearii^g 
in the training set. T^ refers to the lower level (about 5 terms) 
and T^ refers to the higher level of about 15 terms. This factor is 



fixed and qualitative - 'because the nxiniber of terms varied slightly, hut 


vas identifiable at either a high or lo\r level. 

Documents in the training set (D) were run at two levels. A 
lower level of 25 documents and an upper level of 50 docu- 

ments was used. Factor D is fixed and quantitative. • 

Questioxis (Q)-were run as the replication (or block) vai'iahle 
to lower the error variance.. The entire experiment is conveniently 
classified as a 2^ factorial run in a randomised block design. 

Four questions (replications or blocks) were. used. Factor' Q is ran- 
dom and qualitative. 

T . 332 ~V'ariables.Held-Gonstant-as..Boundary-Condltions .on the Problem . 
'This includes the fraction of the training seu which is relevanb (about 
50%), the time allowed each analyst uo construe b the BRS, and the 
method of query presentation to the analyst. Other variables held con- 
stant are the extent -of file searched (one month, -or lx88l documents) 
and the particular time period of the file (March, 1969). 

T.333 Uncontrolled-Factors Contributing to the ‘Error Variance . In this 
group ai'e the system indexing, compatibility of the question to the 
system, aixd consistency of the question itself. Also, the variation 
between analysts within method 1 contributes to error variance. 

7.34 Factors and Variables Kot Considered in the Experiment 

The following important .items were not considered in this ex- 
perimental pi’ogram. 
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(a) The effect of adaptive refinement of the BRS through, aidd- 
itions and/or deletions from the training set , followed by repeated 
searches j was not investigat.ed. BRS's refined over several searches 
by supplementing the training set would be expected to produce better 
results than, the BES,'s used here. 

(b) Experienced information analysts were used to construct 
BRS's for test purposes instead .of casual .system. users . The effect of 
user experience was, not investigated, but casual users would not be 
expected to construct BRS.'s which would be as effective as those of 
more eDg>erienced users. 

(c) Most of the problems solved for index term, weights (Mg) 

exhibited alternate optimal solutions (see section 5.5). The retrie'cal 
efficiency of these alternate optimal solutions was not invesuigaued. 

The initial optimal solution .was always used for- retrieval purposes. 

T.35 The Model Eq^uation and Expected Mean Squares Table' 

The model equation for the factorial experiment is given by: 


y. = U T. + D. + TD. . + M, + 

‘'ijkA X- j ij li 


TM., +354., + TDM. + Q 
xk jk xjk £ 


+ e 


ijkil 


(T-5) 


where i = 1,2 and 


f i=l 5 index terras 

i=2 15 index terms 


f j=l 25 documents in training set 
=§>■ 50 documents in training set 


^ = - 1,2 


and 
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k = 1,2 and f k=l method 1 (analysts produce BRS) , 


IkfS 


method 2 (machine produced BRS) 


and 


£ = 1,2,3,^- for if- questions ea.ch functioning as a replication. 


All factors except Q are fixed, .Q is a random factor. The 
expected mean square table is shovni helow^^^^^^ 


Factor Fixed or Degrees of Expected 

• - - . . -random freedom’ mean square 


T(index terms) 

P 

1 

• 0^ 
e 

*h 


©(documents ) 

P 

1 

02 

e 



TD - 

P 

1 

ct2 

e 



M( methods ) 

j 

P 

■1 

o2 

e 

+■- 

1602- 

M 

TM ! 

F 

1 

o2 

A 

-i' 

80 2_ 

tM 

DM 

F 

' 1 

o2 

e 

+ 

802 

DM 

TIM 

F 

- , .1 

o2 

e 

■*r 

'“’tdm 

Q(questions ) 

R 

3 

02 

e 



error 

R 

21 

02 

e 



; an exact F test 

exists 

for each 

0 f -the 

effects 


(7-6) 


the error mean square, 

7.36 Choice of Sample Size 


The sample size vas determined by choosing acceptable risk 
levels associated with the test for a difference between treatment 



means for main effect M (me'shods). This test between means is sum- 
■•mariaed by the follo>n.ng hypotheses: 


H : [mIS(M^) - NIS(M )| - 0 


(T-T) 


jlFI3(M^) -NISCM^)! > ' ^T-8) 

Here HIS(M^) and are the true mean values, of the normalized 

information statistic for mechod 1 (subjective) and method 2 (machine). 

Estimates of the mean and variance of the HIS for a one month 
search of the NASA -file were" first determined subjectively. These 
estimates were 28. 3-per.cent.' for the average NIS and . 177 = cr| for the 
NIS variance.! 


TJiO test statistic for a difference d betireen NIS treatment 


(103) 

means is given as; 


■f • 


[SrsfM-) - NIS(M )] ~ S 


A 


\ 

\ e\"l ^2/ 


(7-9) 


which is distributed as Student's t • with v degrees of freedom 
where ; 

(a) N-TS(M^), NIS(M 2 ) are the treatment means ( average NIS responses 

for methods and M^); ' 

(b) are the number of replications in each treatment mean; 

(e) is the error variance (of the NIS response) as estimated from 

the experiment; 
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(d) 6 is the true dirferen.ee between the treatiaenx means (difference 

between ITIS responses for methods and M^); and 

(e) V is the ember of degrees of fi’eedom in S^. 

e 

The null hypoxh.es is (T-T) now becomes 6 = 0. 

After some deliberation , it %ras decided that a true difference 
6 = 10% between HIS response to xhetwo different treatments ■would be 
meaningful to retrieval system operation and -should be. detected by the 
experiment. .Also, .the type- 1 error (alpha) was fixed at O.'IO. Because 
the cost in both time and- effort of experimentation is great, a com- 

•D 

promise was reached -for four- replications of the 2 ~^ factorial, or 32 

data points (searches). This gave r., = = l6: v = 21i and c = 

Id 

13..3, which is the previous subjective es'bimate of the HIS standard 

J, » 

-An operating characteristic curve constructed for the t-test 
(7-9) "using this data is summarized below, where the type II error 
(beta) or. not detecting a true difference 6 is given as a fmetion of 

6 . 


True difference 
-6.=^ (B/H)100 

0 

2 

5 

10 ' 

15 

20 


Type II error 
(beta) 

0.90 
.81 
.62 
.23 ‘ 

■ .05 

■ .01 


(7-10) 


' In summary, for the chosen configuration, it can be seen that 
if the true 5 = 10., the probability of not detecting this difference 
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is 0,23 (the heta error). Alternately 5 , there is a probability of 0.10 
Cthe alpha error) of falsely detecting a significant difference, given 
that there is none. 

T .37 A Sub-Experiment to Determine the Effect of Analysts within 
Method 1 

Analyst's are considered to be an integral part of method 1 for 

3 

the 2 factorial experiment. However, when method 1 is considered 
alone, it is meaningful to isolate the effects of the analysts. 

To consider this effect-, it was necessary to -control the 

I 

arrangement of analysts, q.uestions and treatments within method 1. 

This was done with, a latin ^square configuration \ The model equa- 
tion is: ' ' 


s'ijk ' i* \ ’’k 'ijk 


where A^: i=l,2,3,^ are analysts; 

Q.:' j=l,2,3,4 are questions; 

and Tjj,; lc=l,2,3>^ are treatment combinations. 


(T-11) 


Figure T-2A shows the particular latin square configuration chosen. 
One combinatioii shotm in this figure is query 1 (Q^^) with analyst 3 
(A^) using treatment 4 (Tj^^) , which consists of a training set of 50 
documents and a BRS w’^ith 15 index terms. 
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7.4 Presentation of the Experimental Data 


'.7.41 Factorial Experiment Response Data 

Table 7-1 shows the response data from the 32 experimental 
searches. Each .response is given in terms of contingency table entries. 

^This is followed by the UIS,‘ the recall and the precision of the search » 
all of which are computed from the contingency table. • 

:-For example, consider data point -8 of table 7-1 (read across 
line B). This point corresponds to a search with a BBS formed using 
nominally 15 .- index terms -(T^ =-15); from a training set with 50 docu- 
ments (D^ = 50 ); using method 2 (M^ for a machine BRS-) ^ and searching 
puery 1 (Q^).. -’The corresponding 2x2 contingency table is given by: 


Relevant f^elevant 


Relevant j 

3 

r— r 

1 

12 

Not relevant j 

29 

1 

[ 4840 

4869 


32 . 

4849 ' 

4881 


(7-12) 


This table corresponds to { 7 -I'). The TsflS is' computed from (7-12) using 
the same methods presented in the example of Fig. 4-2, and described -in 
section 4.55. For search 8, the NIS is 10.15 (percent). The search 
recall and precision are also computed from (7-12) by -using (7-2) and- 
( 7 - 3 ).. These are given as 0.333 and 0.094, respectively. 
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TABLE 7-1. - RESPONSE DATA FROM FACTORIAL EXPERIMENT 


Data 

point 

Treatment 

combination 

Q M D • T 

"ll 

Contingency 

table 

^12 "?21 ^22 

NIS 

Response 

Recall Precision 

1 ■ 



c>c; 

5 

3 

9 

19 

4850 

11-54 

0.250 

0.136 

2 




15 

4 

8 

27 

4842 

15.35 

.333 

.123 

3 


* 1 


5 

10 

2 

21 

4848 

55. 90 

.833 

.323 

4 



50 

15 

6 

6 

34 

4835 

25.00 

.500 

.150 

■ 5 

1 


pc: 

5 

8 

4 

27 

4842 

39.10 

0.667 

0.229 

6 


o 


15 

5 

7 

26 

4843 

20. 94 

.420 

.161 

7 


c» 


5 

4 

8 

23 

4846 

16.05 

.333 

-.148 

8 



50 

15 

3 

9 

29 

4840 

10.15 

.333 

.094 

9 




5 

2 

2 

3 

4874 

35.35 

0.667 

0.500 

10 



25 

15 

2 

2 

3 

4874 

35.55 

.667 

a 500 

11 


J- 


5 

2 

2 

9 

4868 

29. 65 

. 667 

.500 

12 



50 

15 

2 

2 

79 

4798 

16.90 

.667 

.053 

15 

C 



5 

0 

4 

74 

4803 

0.19 

0.000 

0.000 

14 



25 

15 

0 

4 

116 

4761 

.30 

.000 

.000 

15 


■ 2 


5 

0 

4 

60 

4817 

.15 

.000 

.000 

IS 



50 

*1 c: 

5 


114 

47S3 

27.77 

1.000 

.036 

17 




5 

4 

8 

90 

4779 

9.98 

0.333 

0.042 

18 



25 

15 

1 

11 

3 

4866 

4.53 

,033 

.250 

19 


1 


5 

8 

4 

115 

4754 

26.36 

.667 

.065 

20 



50 

15 

1 

11 

27 

4842 

2,14 

.083 

.036 

21 

- 3 



5 

1 

11 

25 

4844 

2.22 

0.083 

0.038 

22 



■ 25 

15 

■ 4 

8 

136 

4733 

8.15 

- .333 

.028 

23 


2 


5 

8 

4 

227 

4642 

20.16 

.667 

.034 

24 



50 

15 

6 

6 

247 

4622 

11. 68 

.500 

.024 

25 




5 

0 

5 

1 

4875 

0.00 

0.000 

0.000 

26 



25 

15 

2 

3 

2 

4874 

29.08 

.400 

.500 

27 


1 


5 

0 

5 

0 

4876 

0.00 

.000 

.000 

28 



50 

15 

3 

2 

1 

4875 

49. 65 

.600 

. 750 

“ 29 ' 

- 4 



~ 5 

0 

5 

68- 

4808 

0.18 

0.000 

0.000 

30 



25 

15 

0 

5 

127 

4749 

.34 

.000 

.000 

31 


2 


5 

0 

5 . 

69 

4807 

.18 

.000 

-000 

32 



50 

15 

1 

4 

73 

4803 

4. 45 

.200 

,014 


Mean responses: 


15.90 0,353 


0.148 
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7.42 Latin Square Experiment Response Data 

Tatle 7-2B gives the response data (NIS only) for this exper- 
iment. For example^ the BRS submitted hy analyst 2 for question 3 re- 
sulted in a search having an NIS response of 26.36. (This is search 19 
of table 7-1.) 


TABLE 7-2. - RESPORSE DATA FOR LATIN SQUARE 


DESIG-R WITHIN METHOD 1 


A. Latin Square Layout 


Qg Q3 \ 


^1 

'^1 

^3 

^4 

Tg 

T, = 

= 25/5 



.L 

■'5 

■'i 

^2 ^ 

= 25/15 

A 

/ 

rr* 

rp 

'T> 


T3 ^ 

= 50/5 

6 



2 

0 

T4 = 

= 50/15 


T 

3 

T 

■^2 

^1 

^4 


' 


Treatment 

definitions 


B. Latin Square Response Data (NIS) 



% 

^2 

0,3 

^4 

^1 

11.54 

23.65 

2.14 

29.08 

^2 

15.35 

16.90 

26.36 

0.003 

^3 

25.00 

35.35 

4.53 

0.000 

<i 

55.90 

35.35 

9.98 

49. 650 
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7.43 Predicted Document Utilities vs. Known Document Utilities 

Half (l6) of the experimental searches were performed using a 
machine derived- BPS (M^). Recall that either the index term weights or 
an equivalent BRS can be used to search the file. For the l6 data 

points j the file was searched using the term weights . • This was done as 
a matter of practical convenience. (The equivalent BBS's were also de- 
rived and will be discussed in’ section 7.45.’) 

VJhen searching with term weights, a predicted utility u is 
computed for each document in the file. Because' the utility threshold 
T varies from question to question, it -is convenient to compare pre- 
dicted utilities by using (u - t) instead of u. Here (u - t) ^ 0 
if the document is predicted to be relevant' and (u - t) <0 othervrise'-, 
v?hen ireightsd index uerm searches are performed on a file of 
documents , any given document from the file ends up in one of three 
categories . ' ' ' 

(a) Ho index terms writh assigned weights match index terms in 
the given document. 

(B') One or more of the index terms associated with the given 

A 

document matches index terms in the search strategy, and [u - t) >_'0. 
(Relevance is predicted.) 

(c) One or more of the index terms associated with the given 
docimient matches index terms in the search strategy, and (u - t) < 0. 

For the l6 weighted term seax'clies , aii average of 5*72 Tjercsnt 
of all documents fell into categories (B) or (C) above; 3.82 percent 
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had (u - t) <0 and I.90 percent had (u - t) ^’0. For the file of 
.U88I documents , this gave an .average per-search yield of 93 documents 
with (u ~ t) ^.0 (category B) and I86 documents with (u - t) <0 (cat- 
egory C). 

Table T-3A shows the relative frequencies P(u - x) of the pre- 
dicted utilities 'for all searches in categories (b) or (c) above. 

Since coefficients of the LUPF are integral multiples of 1/2, so are 
the values -of -(u - t). (See section 5-7*) For example, l6.k3 percent 
of all documents in categories (b) or (c) had predicted utilities of 
-3.0 or -2.5'. The distribution of P(u - t) tends to be bimodal; 

A, 

having separate 'modes for the documents with (u - t) ^ 0, and for those 
with (u ■- r) < 0. 

Sclavant documents in the file had been identified and assigned 
utilities before the searches were run. It is possible to compare the 
pre-assigned values of (u - t) for these relevant documents .Tidth the 
values of, (u - t) . predicted by the system. 

Tables T-3E and 7-3C compare the predicted (u - T ) with the 
assigned (u - t ) for the relevant documents only. Table .T~3B gives a 
coarse cross-'classificalion showing (u - t) grouped into categories 

(a) , (B) or (C) above. For example,. 13 relevant documents with an 

assigned (u - t) = 1 were placed by the 16 searches into category 

(b) . There were a total -of 132 relevant doctiments associated pith the 

group of 16 searches. 

Table 7~3C gives a more detailed breahdoTO of cross classifica- 


tion information contained in table 7-3B. For example, three relevant 



TABLE T-3, - COMPARISOIT OF PREDICTED DOCUMS^fT UTILITY 


V/ITH ACTUAL DOCUMEKT UTILITY 

I 

s 

A.. Relative Frequencies of Observed Values of (u - t) 


U - T - 

P(u - t) 

0.0 

0.o459 

1.0, 1.5 

.0985 

2.0 

.l4io 

3.0 

.0378 

4.0 

.0054 

5.0 

.0009 

6.0 

.0020 

7.0 ■ 

.0007 

8.0 

,0004 


n ~ T 

P(u - t) 

-8.0 

0.0007 

-7.0 

.0013 

-6.0 . 

.0018 

-5.0 

.1014 

-4.0 

.1072 

-3. 0,-2. 5 

.1645 

-2. 0,-1, 5 

.1386 

-1. 0,-0. 5 

.1518 


B. Comparison of Preassigned Document Utilities (u - t) With 
Those Predicted- by the Linear Model (u - t) for Relevant 
Documients 

Predicted Utilities - (Coarse) 




1 

j No 
1 match 
(A) 

! / ^ \ ^ 1 

i - t; ^ 0 

! i 

1 ‘(B) 

1 (u — t) < 0 

(c) 

! 

I 

True 

0 

23 

- 19 

6 

48 

Utilities 

1 

22 

13 - 

9 

44 

(u - t) 

2 

15 


”T” 



3 

4 

1 5 

3 

12 



64 

44 

24 

132 


C.' Detailed BreaEdo™ of Table 7-3B Above 
Predicted Utilities 
(u - t) 



-4 

-3 

-2 

-1 

0 

1- 

• 2 

3 

4 

5 

6 

(A) 


0 n 


■ 2" 

'IT 

13 

1 2 

1 

0 

3 


' 

23 

48 

True ' 1 

1 


4 

5 

1 

! 3 

3 

5 



1 

ILj 

44 

Utilities 2 

1 


3 

2 1 

2 

rr 

2 

2 





1 28 

(u - t) 3 

r I 


2 

1 1 

! 

1 

1 1 

2 

i. 

! 

i 



4 

12 


1 1 

0 

11 

12 

17 

1 T 

3 

8 

1 3 

0 

1 

64- 

132 
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documents had a predi cted utility (u - x) = ii-. All four of these docu- 
ments had an assigned or true (u - x) = 0. 

A 

T.4U Values of R for Index -Terms in the Training Sets 

Table 7“^ shows the distribution of R = H(x) - H(x/Y) for the 
index terms which .appeared with documents in the eight different train- 
ing sets used for the experiment. An average of l48 different terms 
were found with each 25 document training set and an average of 250 
terms were found with each 50~docunien’t training seto To illustrate the 

A 

use of table ,7-^j there are two index terms 'id.th 0.15 ^ 0.19999 in 

the 25. document training set (D = 25) associated with query 1 

TABLE 7-^. - DISTRIBUTION OF. ,R = H(x) - H(X/y). FOR IJIDEX 
TEEMS APPEiRINO IN THE TRAINING SEffib 

D .= 25 D = 50 

R(bits) Qg ■ S. % S \ 

■ 0.00 - 0.04999 
.05 .09999 

.10 .14999 

.15 .19999 

. 20 - .24999 

.25 .29999 

Total terms 
7.45 BRS Descriptions 

The BRS is a union of index term solution families . It is con- 


no 91 121 i 4 i 


195 210 280 254] 

7 28 36 23 

^ r 

11 ““15 10' 13 

"S 9 '^^6 2*~ 


2 6 l\ 

2 32 

95 ^ 

1 1 


conf . 

1 

1 

L 

1 I 


128 128 166 168 208 225 298 269 


venient to describe a BRS by using some particular attribute of xhe 
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solution families from which it is formed « One useful attribute of an 
individual solution family is the munber of index terms ;fhich must be 
simultaneously present in a document to cause the document to match 
the family and hence be retrieved. This attribute will be called the 
SIZE (s) of the family and will he used to compare machine-generated 
BBS's with those heuristically generated by users. 

Let bhe SIZE S of a solution family be the number of fixed 
variables which equal unity in the family That is 

n ~ 

S = ^ Pkj for Fkj ?-(-) (7-13) 

j=l 

Tiie follotd-ng simple example Illustrates this definition. 

Family (T^T^T^Tj^) SIZE(s) 

F^ (1,0, -,1) 2 

Fg (oa,-,-) ■ 1 ■ (T-iit) 

F^ (1,1,1,-) 3 

Families ■^d.th . S = 1 are those which specify the presence -of 
only one matching index term in order to retrieve the document. Fam- 
ilies vTith S = 2 ■ require a specified pair of index terras to be pres- 
enb. Note tha,t variables in the, faaaily which are fixed at zero require 
the absence of the corresponding index term in order that the document 
will be retrieved. 

Table 7-5 shows the distribution of solution families having a 
size S ■'/ri.thin a BBS for 30 of the 32 BBS's used in the experiment. 
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(BRS data was lost for data points numbered 29 and 30.) For example, 

consider data point (question 2, M_ (mach.ine) , a 25 document train- 

I . - 

ing set, with a- nominal 15 index terms used for the BRS). There were 
12 solution’ families in the associated BRS. Four 'of these families had 
S = 1, six had S = 2 and two had S = 3. 
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TABLE 7-5. - DISTRIBUTION OF SOLUTION 


FA14ILTES HAVING SIZE S- 



L is the actual number of index terms in the BRS. 
T = "^5 or 15 is the nominal nuinber 
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8.0 EXPERIIvEENTAL DATA ANALYSIS 


8. 1 Analysis of Variance Toi’ the Factorial Experiment 

The response data for each of the 32 experimental searches 
appears in Table 7-1- Three different measures of search effectiveness 
are considered (NIS, precision and recall). The experimental data of 
Table -7-1 is analyzed separately for each measure of effectiveness. 
Three corresponding analysis of variance (ANOVA) tables are sho'wn in 
Table 8-1,'. These will be 'discussed below. Only effects which are 
significant at an alpha level of at least 0.10 (confidence level of 
90 ^) will be discussed. 

6 . 11 Dsn o “tliw Ms t lio ds 

The ANOVA table for this measure of effectiveness is sho^m in 
Table 8-lA. Tlie only factor significantly affecting the NIS is that 
of search methods (M) . Heuristic BRS's gave better., results than 

did machine BRS's (M^). The experiment treatment means are: 

N^(M^) = 21.67 
'i^(M2) = 10.13 

A = inS(M^) - = 11. 5^ - 
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TABLE 8-1, - AHALySIS OP VAEIAUCE TABLES FOE FACTORIAL EXPERIMEDITS 


Source of 1 

Sums of 

Degrees of ] 

Mean 

F 

variati'on | 

sq.uares 

freedom j 

squares 



A. UormaliKed Information Statistic (KIS) 


T 

D 

TD 

M 

6.79 
218. 1^3 
8.20 
1067.37 

1 

1 

1 

1 

■ 6,19 
218.4.3 
8.20 

1067.37 

<1- 

<1 

<1 

4.52 

2.96 1 

TM 

' O2 

1 

oT^T 

<1- 


DM 

63.87 

1 

63.87 

<1 


TDM 

176.36 

1 

176.36 

<1 



1055.93 

3 

351.97, 

1.49 


ERROR 

4959.60 

r 21 

236.17 1 

[ 


TOTAL 7556.97 31 


'B.' Recall 


T 

D 

TD 

M 

0.02832 
.24746 
. 00720; 
.15318 

1 

■ 1 
1 

0.02832 

.24767 

.00720 

.15318 

<1 

3.17 

<1 

1.96 

2.96 

TM 

DM 

TDM 

Q 

.03920 

.00189 

.07801 

.50867 

1' 
1 
- 1 
3 

.03920 
. 00189 
. 07801 
■. .16956. 

<1 

<1 

1.002 

2.17 

2.38 

ERROR 

1.63775 

• 21 

.07799 




TOTAL 2.70169 31 


C. Precision 


■ 

T 

D 

TD 

M 

0.01565 

.00262 

.00902 

.30574 

1 

1. 

1 

1 

0.01565 
. 00262 
.00902 
.30574 

<1 

<1 

<1 

8.48 

2.96 

TM 

DM 

TDM- 

Q 

T 0^97 
.00017 
. 01374 
, 08101 

1 

1 

1 . 
3 

.00017 
. 01374 
.02700 

<1 
<1 
• <1 
< 3 - 

ERROR 


21 

.03602" 

... ...--.i 

__J 

. 


TOTAL 1.20959 


31 
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Confidence liialts for tlie true difference A = [NIS(l4^) - 

^reen "I 
(105) 


KIS(M 2 )] bet-ween the treatment means at the (l - a) confidence level 


are given hy: 

A - t(v,a/2)S 




■ a/2" 


: A 


< 5 < A + t(v.a/2)S 


(-*- 

-\r r, 


\l/2 




a) 


(8-1) 


where 

6 = the true difference in treatment means; 

A = the observed difference in treatment means ; 

Sg = -the square root of the mean square due to error; 
r^, r^ = the number of data points used to confute the treatment 
means ; 

a = the error probability; and 

t(v 5 a/ 2 ) = the student's t statistic mth v degrees of freedom. 

For the difference'in KflS me^ response we have = ^236. 2 = 
15.4,a = 0.10, r£ = r^ = l6, v = 21 and t (21,0.05) = 1.721. The 90 
percent confidence interval for the true NIS -difference .is thus: 

2,19 £ [ins(M^) - J0S(M2)] <. 20.89. 

lote that although hlS(M^) is estimated to be twice- as large as 
NIS(M 2 ) there is considerable room for improvement in since this 

method is operating only at 21. 6 t percent efficiency. 
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8.12 Dependence of Precision on Methods 

The MOVA table for search precision is shown- in Table 8-lC. 

The only factor significantly affecting search precision is methods (M), 
The treatment means are; 


P(M^) = 0.245 
^(M^) = 0.050 

A = P(M^) - PCM^J = 0.195 . 

The difference A is significant at the 99 percent confidence level. 

By using (8-l), a 99 percent confidence intervall. can he established for 
the true difference in search precision: 

0.005. £ [p(M^) - fCM^)] £ 0.385o 

Por this application, t:(v,a/2} = t(21*,0.005) = 2.83, S = s/o.036 - 

0 

0.190 and r^^ = r^ = 16. 

The mean precisions given above are for individual searches. 
Comparing pooled and searches prorddes an illustration of the 

large difference in sear-ch precision, * A total of 484 documents were 
predicted relevant by the 16 searches, and 50 of these were actually 
relevant. For the I6 searches, l484 documents were predicted rele- 
vant, ^rith 43 being actually relevaiit. 
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8,13 Dependence of Recall on Training Set -Size 

The MOVA table for search recall is shovm in' Table 
Only the numbe-r of documents in the training se'fc (I'actor D) signifi- 
cantly affects search recall. The training sets with the most docu- 
ments lead to searches ■^dith better recall. The experiment treatment 
means are ; 


R(D^) = 0.261 

rCd^) = o.llo 

A = - R(Dj^) = 0.176 . 

Til'S' 9^ psz'csii't ccmCd-dsncs in’fcsz'val. foz* iihft trus 
tween treatment means is given below by (8-I) with = r^ = l6j 
t(v,ci/2) = t(21, 0.05) = 1.721 and S^ = VO78 = 0.279: 

0.07 < - K(D^)] < .345 c 

Comparing pooled D^ and D^ searches further illustrates the 
observed differences in search recall. A perfect retrieval system 
would have foirnd 132 relevant documents for either the 16 searches 
or the 16 Dg searches. In the' experiment > the 16 D^ searches found 
only 36 of these j while the 16 D^ searches located 57 of them. 
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8. ill Lack of an Effect Due to the Kuiaber of BRS Index Terms 

! 

The number of BES index terms (factor T) did Eor have a signif- 
icant effect on either the search FIS, precision, or recallc’ This is 
somewhat unexpected, and may.be due in pai't to an unfortunate source of 
uncontrolled variation in the experiment,. 

The. nominal levels of factor T were set at 5 and 15 because 
these levels were approximate upper and lower limits -for the number of 
index terms used normally by analysts in their BBS's. Accordingly, the 
machine system selected the ’best' 5 or 15 index teami column vectors 
for inclusion in the approximation problem. Unfortunately, these 

chosen binary colvunn vectors were not often linearly, independent, and 
thus the optimal basis in tlie linear programming problem contains 
fewer than 5 or 15 index term vectoirs mth non- zero weights . (For a 
itirther discussion of this, refer to section 5.6") The final number of 
terms in the BBS's was correspondingly reduced to less than 

T^ = 5 or T^ == 15" This is illustrated by the data of Table 7-5; 
where the column labeled L shows the actual nxuaber of index terms 
appearing in the BES. The average 'high' level (T^) is 10. index terms 
(instead of 15), and the average low level (T^) is 3.8 instead of 5» 

Experimentally, this would have the effect of 'smearing' the 
level of factor T, and might mask effects of variation due to this’ 
factor. The levels of factor T in the experiment must be considered 
qualitabivel^/- a 'high' or 'low' instead of quan-bit ati vely as was orig- 
inally intended. Suggestions are offered in section 5.6 for overcoming 
this difficulty in future applications by modifying the LB program. 
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8.15 Summary of the Factorial Experiment 

The factor M (methods) had a significant effect 'on both the 
search precision and EIS. Furtherjuore , it was the only experimental 
factor which had an effect on precision or the ITIS. The l6 searches 
,(heuristic BBS's) had an average NIS response of 21.67 and an average 
precision of 0.2h-5. The l6 searches (machine BBS's) had an average 
NIS of 10.13 and an average precision of 0.050. From Figure T~lj 
virtually the entire observed average difference in NIS response be- 
tween and can be attributed to the observed average differ- 
ence in precision between and M^. This large observed difference 

f 

in average search precision between and is felt to be related 

to differences in selection of index terms and structural form of the 
BBS. Evidence for this will he presented in subseq,uent sections. 

Search recall was observed to significantly depend on the num- 
ber of documents in the training set » and to be Independent of the 
search method. The average search recall for the 25 document training 
set (D., ) was 0.26^1-, while the 50 document training set (D ) led to 
searches with an average recall of O.ljlfO. 

The number of index terms (nominally T^ = 5 and T^ = 15) ex- 
tracted from the training set and used for subsequent . BBS formation had 
no observed significant effect on the search recall ^ precision or NIS. 
The levels T^ and T^ varied somewhat during experimentation. This 
may have helped to obscure a true effect if one were actually present. 
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8.2 jiiialysis of Variance for the Latin Square Sub -Experiment 

This experiments ss discussed in section 7 -37 was designed to 
determine whether there are significant differences between analysts , 
questions or treatments when method (heuristic BRS formation) is 

considered alone. Response daxa for this experiment appears in Table 
7-2B. The AHOVA is shoira below in Table 8-2. 

Conclusions are simple. There are ^ significant effects 
attributable to eixher analysts, questions or treatments which are dis- 
cernible from the experiment data at the chosen 90 percent confidence 
lerel (or even at the 75^ confidence level). 


TABLE 8-2. - ANALYSIS OF VARIANCE ■T.'IBLE FOR LATIN SQUARE EXPERIMStIT 


Source 

of variation. 

Fixed or 
random 

Expected 
mean squares 

df. 

S& 



F( 0 . 75 L 

aj-alysts(a) 

R 

o| + 1602 

3 

1396.98 

465.66 

1.58 

1.78 

QUESTIONS (Q) 

R 


3 

837.90 

279.30 

<1 


TREATMENTS (T) 

F 

02 + 16o2 

3 

394.85 

131 . 62 

<1 



R 

2 

6 

6 

1766.94 

294.49 



TOTAL 



15 

4396.66 





8.3 Extraction of Best index Terms 


8.31 Distribution of R 


Table 7~^ was discussed in section 7.Ji4. This table shows the 
relative frequencies of observed values of R = h(X) - h(x/Y) for che 
eight training sets which were used to generate the experimental. BBS's. 
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The quantity (0,693)(2HR) is asympxotically distributed as a chi-squared 
variate with one degree of freedom (when E is in bits ) , under the null 
hypothesis that R = 0. (See section ^.54.) If the alpha error is 
fixed at 0.05, this null hypothesis can be re jeered when E is greater 
than 0.1105 fc>3? index terms in a 25 document training set (l=25) , or 
when R is greater than 0.0504 for the 50 document training set (N=50), 
Index terms meeting the above er-iteria can be considered s t at i sti c ally 
significant -predictors of document relevance at the 95 percent confi- 
dence level. 

From Table 7-4, the average number of index terms having a 
statistically significant value of R at the 95 percent confidence 
level is eight terms for each 25 document training set and 15 terms 
for each 50 document training set. These averages are in line with 
the nominal values (T^ = 5 and T^ = 15) chosen for the experiment 
using another criterion. .(See section 8,l4. ) 

8.32 Differences in Index Term Selection bet’iT'een Methods 

There are two major differences between index terms selected 
using and These are: differences in R' evaluated over the 

training set ; and differences in the annual frequency of index term use . 
8.321 Differences in R . The individual index terms selected for the 
BES using (machine methods) are those ha%dng the highest values of 
R. The average value of R for index terms extracted heuristically 

* A 

(Mj^) was only about half that of the average R using The 

overall search effectiveness (EIS), however, is better- for than 
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Mg. It follows that the index terms chosen bjr anal3>'st-s are better in- 
dicators of document relevancy over the file as a whole than are those 
selected by the Mp ms-chine methods . This suggests the use of extra 
information by analysts from outside the training set during the term 
selection process.. 

f 

8*322 Differences in Frequencies of Term Occurrence . The freq^uenc^f of 
term occtirrence over the file as a whole was not a selection factor for 
method Mg (machine ) . The annual frequency of occxirrence for the Mg 
index terms has a mean of T73 and a variance of 59Tjl00. For method 
the population of index .terms selected by analysts and used to con- 
struct BES families xdLth S = 1 (see section 7*^5) has a mean annual 
frequency of occurrence of 177 and a variance of 31j300. The hypoth- 
esis that the mean frequencies of occurrence are the same ‘for M, and 



Mg index terms can be rejected at the 99*5 percent confidence level. 
This implies that the analysts of are utilizing frequency of - 

occurrence information (which i_s not available from the training set ) 
when they choose index terms . To summarize ^ the analj’-sts select 

terms to use in their BBS's which have a frequency of occurrence lower 
by a factor of 773/177 = ^.37 than those terms selected for the BBS’s 
of method Mg, 

8.33 The Sampling Problem 

The problem of choosing a representative training set is one of 
sampling from the document file. A random sample is usually assumed 
for the training sets of pattern recognition systems. However j in a 
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large docvunent retrieval system, a randomly chosen sample f'or the 

training set is infeasible for practical reasons. To illnstrat,e 

I 

assume 500,000 documents are in a file, and than 100 of them are rele- 
vant. Now it would req.uire , on the average, a random sample of 5,000 
documents from this file to provide a training set which would include 
one relevant document. Clearly, a sample of this size is unmanageable. 
A training set with only 25 to 50 documents is considered typical. 

Some reasonable percentage (near half) of all training set documents 
should probably be 'relevant to Insure reasonable retrieval results. 

Thus a typical training set with 5Q documeiits (and 25 relevant) con- 
stitutes a highly enriched ’ sample , as opposed to a random chosen 
training set. 

The Results of seetkon 8.32 indicate that the analysts of 
are iitsing supplementary information! to select index terms. It is in- , 
teresting to relate this observation to the phenomenon of non-random 
sampling discussed above. 

The data presented in sections 8.321 and 8.322 suggests that 
the supplementary information is of tvro forms. First, the analysts' 
Imowledge of term occurrence frequency is used to avoid those terms 
Which occur frequently, even though they have a high value of R over 
the enriched training set. Perhaps the analyst 'feels' (for example) 
that there are only 15 relevant documents in a one-month section of 
the file. Tills leads him to reject any index terms which he knows have 
more than 50 associated documents (on the average) in a one-month sec- 
tion of the file. If the training set size were greatly increased, it 
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is felt that the same low freguency index terms would also he selected 
hy method M^- 

Secondly, pure sampling error of a random ua-c'ure maj’’ cause 
terms to appear to be good discrimnators , when in fact, vrith a larger 
training set they would not be. These terms are excluded by the 
analysts because they do not ’fit in' with the analyst’s concept of the 
guery. Here the analysts supply information based on their prior know- 
ledge of the guery and their prior knowledge of language use. 

In conclusion it is hypothesized that the supplementary inform- 
ation used by the analysts of to select -index terms compensates 

for the small size and non-randomness (eni'-iehment) of the training set. 
A high index term freguency of occurrence would tend to reduce the 
value of E for this term in if the sample size were increased. 

A 

Also, the probability of observing ijnrelated index terms with a high R 
decreases as H, the sample size increases. 

8 . k Analysis of the BE3 

8, In Dependence of BRS Solution Family Size on Methods 

Table 7-5 shows the S1ZE= S distribution of constituent families 
of the BBS's for all 'the experimental searches (see section 7«^5). 

There are several striking differences between the BRS's for and 

Mg wlieii they are compared using the SIZE(s) of their constituent solu- 
tion families. Table 8-3 presents this comparison. 
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(a) The analysts use (on the average) 27.2 solution fam- 

ilies to maJce up a BRS (recall that each solution family is a 'matching 
template'). On the other- hand, each BKS is composed of an average 

of ‘108.2 solution families, 

(B) The analysts composed their BBS's using only solu- 

tion families having S <_ 2, Of ii35 solution families, only 15 (ox* 
3.45^) had solution families ^?ith S = I. For the BBS's, solution 

families with S <_ 10 were observed, with S = 5 being the most 
likely value. There were 12 h- (out of 1515) families with 3 2 (ox'- 

8.19^) and 36 with S = 1 (or 2.38^). 

. Because the analysts used fewer solution families per BBS, 

the number of solution families mth S = 1 is less per BBS than 

the Mg families with S = 1 (0.9^ versus 2.57). This causes the 
total number of doexunents retrieved per BBS to be less (on the average) - 
for than Mg . 


TABLE 8-3. - COMPARISON OF BRS SOLUTION FAMILY SIZES FOE AND Mg 



Ml 

^2 

Solution 

Average 

Average 

Average 

Average 

• family 

number 

percent 

number 

percent 

size 

per BRS 

of BRS 

per BRS 

of BRS 

S=1 

0.9li 

. 3.45 

^ 2.57 

2.38 

S=2 

26.26 

96.55 

8,86 

8.19 

S>3 



■ 96.77 

89.43 

Total 

27.20 j 

100.00 

108.20 
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8»if2 Effects of BRS Family Size on Retrieval System Operation 


The SIZE=S of the i solution families mallng up the BRS has an 

I 

effect on retrieval system operation. The expected number of documents 
■which a given family (or template) -w-ill match decreases as S in- 
creases. With S=l, only one term in a document is required to match 
the solution family. Thus, the expected number of matching documents 
in a file covering a given time span is simply the total number of doc-r 
uments indexed with the term in that time span, \flaen S=2, all match- 
ing documents are required to have a pair of matching terms. One would 
expect (on the average) less documents to match a familj’’ •\-rith S=2 

•i * I 

than'wi'^h S=l. 

The foUomng approximate model is useful for descriptive p’or- 

" i ' 

poses. f.et p « j. be the average probability that any given index 

I 

term will be used to index a document. Then q = 1 — p is -the px'ob- 
ability that- a given term will net '- ' -be used to index a given document. 
This assumes, all terms are indspendenn. 

Consider a solution family F which has S variables fixed 
at 1, JJ, variables fixed at 0 and the rest arbitrax>y. Then, the -prob- 
ability of matching - the given berm combination in the family with a com- 
bination of terms in a document is p(F)=p'^^ = since q‘= 1 - p -1. 
For a file with H documents, there will be (on the average) M = 

Mp(F) = documents matching the solution family F. Uow, by using 


log p= 


1 

P 


(since p ^ 0) in the expression log M = log If •+ s log p 
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ve have : 

M = Ne"®/- . (8-2) 

i 

i 

To a rough approxima-oion then, the number of documents matching (or 
retrieved -by ) a given ERS solution, family decreases e xp onenti ally as 
the SIZE=S of the family increases . 

By minimizing the use of solution families \d.th S=l. the 
analysts of have cut dovm drastically on the number of documents 

which will be retrieved by the ERS. This should increase the 
search precision . By avoiding the use of families with S ^ 3 they 
have cut down the search costs by neglecting those documents which have 
a very low probability of matching the BRS. 

I 

8.5 Predicted Utilities of Relevant Documents for 

I 

I 

8.51 Factors Affecting the Recall of the System 

Table 7-3B (discussed in section 7>^3) shows that for the 
knoTO relevant documents (with (u - r) ^O), 33. H percent were cor- 
rectly predicted to be relevant by the system (had (u - t) ^ O), l8,l 
percent were iincorrectly predicted to be non -relevant (had (u-t) <0), 
and 48.5 percent were missed because they had no index terms in common 
■VTith index terms in the BRS. This data shows how the recall of the 
system is affected by errors since only the relevant docuiaents are 


analyzed. 
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The system made errors with 66.6 percent of the relevant docu- 
ments. Of rhese I8.I/66.6 = 27*2 percent were misclassified hy the 
LUPF €ffid it8. 5/66, 6 =• 72.8 percent were eliminated by the feature ex- 
traction process. This indicates that the feature extraction process 
very critically affects the system recall. Improvements in 

recall are most lihely to he hroughr about hy efforts to improve the 
feature extraction process instead of the LUPF estimation process. 

8.52 Effects of Increasing the Vocahulary Sise 

Although not directly sipported hy data here,, the vocahulary 
size (or numher of index terms in the system master list) would seem 
to have an effect on the numher of docimients having no terms in common 
with xhe BPS. Some conjectures are mads helc,r« 

As index terras are added to the master list, all relevant docu- 
ments associated with' a given query' would show (on the average) less 
overlap in their index term sets . This implies that the relevant docu- 
ment index terms would also have less overlap with a ’hast' BPS of 
given size, (it is assumed that Indexing remains at a constant quality 
levels that the same numher of index terms are assigned to a docusaent 
before and after the master lisn is expanded, and that the method of 
BPS formation remains the same.) The reason for this is simply that 
there would he more terms for an indexer to choose from and lienee the 
average frequency of individual terra use 'would he reduced, assuming a 


constant file size. 
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As the term jEaster list is reduced in si?,e. term 'overlap' in 
the set of relevant documents should become greater. TMs would cause 
fe wer relevant documents to be missed but more unrelated documents to. 
be retrieved by a 'best' BRS of fixed size. Tliis is because the terms 
and term combinations would be less specific with a reduced vocabularji . 
Stated another way, decreasing the vocabulary size should increase 
recall and decrease precision . 

8.6 Summary of the Data Analysis 

Many aspects of the experimental data have been analysed in 
this chapter. Only the results which are felt to be most important are 
reviewed here. 

From section 8-1 it is concluded that search effectiveness (in 
terms of the IIS) is significantly greater for method (analysts) 
than for method (machine). It is shown that this difference can, 
be attributed wholly to the significant differences in search precision 
betireen and M^. In other' words , and recover nearly the 

same 'fraction of relevant documents ( recall is the same), but method 
retrieves many more non-relevant documents (a loirs r sesirch pre- 
cision ) . 

Section 8.3 shows that index terms selected by analysts dJ.ffer 
significantly from those selected by machine methods. The major dif- 
ference is that the terms have a much hi^er freq.uency of occur- 

rence, This is undesirable , since it causes more documents to be re 
trieved, which reduces search precision. By using supplementary 
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information about term occuVrence, the analysts are apparently able to 
eliminate index terms which would have been eliminated had the training 

I 

sets been randoml 3 '' chosen, and hence been much larger*. 

Section 8.U demonstrates that the index term chosen by the 
analysts are combined in a much different manner (to form a BRS) than 
are the terms. In particular.,- a greater number of solution fam- 
ilies appear in the BRS’s. Also, the BBS's are constructed 

largely of solution families mth S ^ 3s wiile for nearly all 

families have S = 2. Families td-th S = 1 ■ appear an average of 2.57 
times per BBS with Mgs and only 0..9^ times per BRS with M^. 

The selection of terms with a low frequency - of- occurrence , to- 
gether -JTith the avoidance of solution families with. S = 1 constitute 

the raa.,Tor differences between M, and M_ . These two differences 

' i 

working jointly would account for large differences in search precision 
between and M^. It appears that any attempt to make the machine 

method comparable with M^' will have to resolve these differences 

Section 8-5 analyzes errors which reduced the search re- 

call, About 73 percent of the relevant documents were missed because 
they had no index terms in co mm on mth the BRS. This indicates again 
that improvements in the term selection process would have a major 


effect on search effectiveness. 
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8.T Conclusions 

The results of the experimentation illustrate the hasic applic- 
ahility of pattern recognition techniq,ues to the document retrieval 
problem. 

Test results conclusively shov'- the superiority of the analysts 
to the machine recognition system developed here. The clear super- 
iority of humans to machine systems for recognition of visual patterns 
is -well known. It is one of the reasons for the enduring academic in- 
terest in pattern recognition processes. Thus it is not surprising that 
patterns consisting of index terms should be recognized more efficiently 
by humans than by machine methods . 

VJhat is surprising and encouraging is that the resolution of 
the current differences in system effectiveness does -not appear to be 
out of the realm of possibility. The current best estimated difference 
of 11.5 percent in the HIS can possibly be resolved by extending and 
refining' the model. In particular, two refinements -are felt to be most 
promising. 

First, the methods of index term selection should be extended 
to incorporate term frec[uency of occurrence information. This would 
tend to compensate for the non-randomness of the training or sample 
set. 

Secondly, restrictions should be placed on the BRS to reduce 
the number of solution families \ri.th S = 1 and S > 3. 
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■The atove refinements are discussed in chapter 9. They both 
should improve the search precision of relative to and make 

the differences in overall effectiveness less for the two methods. A 
number of other reasonable extensions to the present system are 

also mentioned in chapter 9* 
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9.0 SUGGESTIONS FOR FURTHER RESEARCH 
9.1 General 

SeTeral suggestions for further research can be made as a re- 
sult of this study. These can be more or less divided into five dis- 
tinct areas, which are summarized very briefly below before details are 
given . 

(A) The information statistic for selecting index terms can be 
modified to talie term freq,uency of occurrence into accoxmt. 

(B) Instead of selecting the best single index terms; term 
pairs or triplets , etc. , can be selected which have a high information 

T 

content over the training set. This is a form of. higher order feature 
extraction. 

(C) The approximation theory model can be altered. Possible 

modifications include a change 'of norm from to or use of 

{0,1,2} variables for x. . based on 'major' or 'minor' terms in the 
training set; use of rougher utility estimates (say +1 or -l) for doc- 
uments in the training set;- and secondary selection of alternate opti- 
mal solutions based on frequency of occurrence of index terms. Also, 
alternate algorithms can be investigated for more efficient solution of 
the approximation problem. 

(d) The solutions of the LPBI can be constrained so that only 
solution families -with S _< 2 or S = 2 will be foimd. This is 
easily done by solving a two-inequality system instead of a single in- 


equality . 
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(E) The important effect of iterative improvement of the 
training set hy repeated searches of the file. can be considered as an 
extension of the previous test methods . 

9.2 Modifications to the Information Statistic 
for Selecting Index Terms 

9.21 Incorporating Information about Frequency 

of Term Occxarrence ' .• 

A revised measure of goodness for index term selection which 
utilizes index term frequency of occurrence information is desired. 

-N A, 

One such measure would he (E./f,) which would replace (R,). Here f. 

J J . i] J 

is the expected frequency of occLii-renee of term j . over the section 
of file to he searched. This measure would reduce the estimated effec- 

A 

tiveness E.. of the individual term if it occurred very frequently, 
d 

For example, the term ' computer- program' might he Judged excellent 
based on the training set value of R, but knowing that it occurred 
1000 times per year jiiight change this Judgment. This would he espec- 
ially true if a prior user estimate were available to the effect that 
no more than 50 documents were relevant in the annual file. 

9.22 Utilizing More Refined Document Utility Measurements 

A. 

It is also possih.le to derive a more refined R without using 
information about frequency of term occurrence. .The present scheme 



assumes a 1)inary utility measiore (relevant or not relevant), and de- 


rives the information statistic from the 2x2 contingency table shown 
below. The entries in the table are obtained from the training set. 


Term present Term absent 


Relevant (u > x) 

n_, - 

! — “t; — 



• 11 

12, 

1 


— — 1 



Ifot re].evant (u < x) 

^21 j 

1 "22 




"•2 

N 


(9-1) 


Since' more refined utility measures are available, a more 
extensive table could be set up as sho^/n below: 


Term^ present "'Term absent 


u = i 

"ll 1 

CJ 

H 

a 

"2- 

u = 2 • 

"21 

”22 


; 

1 

« 

1 

» 

• 

• 

o^ 

II 

1 

1 "92 

1 

. " 9 - 

i 


"*1 

n.g 

i 

i W 

1 


(9-2) 


Table (9-2) can be used instead of (9-1 ) to determine R = 
H(X) ~ H(X/y) by direcD calculation.- 
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9.3 Applying the Feature Selection Process 
to Different Types of Features 

9.31 Higher Order Features 


Either single index terms or term combinations can be consid- 
ered as pattern ' features ’ . The. sysxem tested extracted the -besi: 
single-term features. It is possible to’ consider other types of index 
term 'features'. For example’, all training set index terms can be 

arranged in paris (T. 5 T.), triplets (T, jT.,T,_), etc., having fixed 

1 J i j X 

c on f i gur at ions . Any one- of 'these fixed configxirations can be consid- 
ered as a binary 'feature' and an information statistic E can be 
derived for it. 


Foi‘ an example of two— teim -features , consider the term pair 


(T. jT.). There are four fixed configurations in 'which to arrange this/ 
pair of terms, i.e. 


(T.nT.) = (T.T.) 

(T.nT. ) = (T.T.) 
1 J i 0 


(T.nT, ) = (T.T. ) 
1 j 1 J 

(T.nT.) = (T.T ) 

, 1 J ^ ‘tJ 


Since the same information is contained in (T.T.) as. is contained in 

1 j 

(T.T,), there are only three different fixed configurations to consider. 

1 j 

For a training set with 200 different terms , there would be 
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3(^2°) 3(19,900) = 59,700 term -pair features to consider individ- 

ually . Each of these features would require a corresponding R com- • 
put at ion. 

Methods to avoid complete, enumeration when searching for good 
term-pair features have heen "discussed by Swonger If the 
'features' extracted are of the multiple index term type, the LUFF will 
be of the form 

When this LUPP is thresholded, the resulting pseudo-Boolean inequality 
is no longer linear. Luckily, solving a non-linear pseudo-Boolean in- 
equality "can be accomplished as a simple extension of the linear theory. 
This T-rill be discussed in section 9 . 4 S. ' 


7 Y.f. = y. > where f, are features such as {T^T_). 
J 0 1 d 1 3 


9.32 Selection of Features for Training Set Coverage 

The results of section 8.51 showed that it 8.5 percent of the 
relevant documents were missed because they had no terms in common with 
those in the set of selected index terms. This suggests that perhaps 
single-tem features or term-pair features be chosen not only for their 
good discrimination qualities, but also for their degree of 'coverage' 
of the training set . One way of insuring better coverage is to choose 
features with high information statists cs , but; with low pain-rise corre- 
lation coefficients. This type of correlation screening has been 
( lOT ) 

studied by Maltz for binary features extracted from two-dimensional 


patterns . 
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9.33 Major and lylinor index Terms in the MSA System 

All index terms occurring in the MSA system are assigned as 

I 

either 'major’ or 'minor' terms. Major terms are intended no indicate 
major concepts in the document, while minor terms are used in a sup- 
porting role. Selecting only from the set of major terms would he one 
way of utilizing this built-in form of feature extraction. 


9.h Modifications to the BES Structure 
Avoiding Solution Families with S = 1 


changing the structirre of the BRS to avoid solution families 
with S = 1, the precision ^of the search may be increased. One way of- 
doing this is to incorporate constraints directly on the binary vari- 

t ‘ ‘ 

I , . 

ables of the LUPF. For example, to restrict the SIZE of all solution-, 
families to be ‘less than or eq^ual to 2, we can solve the "system given by 




a.T. > (t - a } 
JO- o 


/ T. < 2 
.4— j 

j 


Another, more indirecc way of restricting the use of frequently 
occurring index terms would be to solve a system such as the following ; 
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L. 


a.T. > 

a a - 


(t 


■a 



> f.T. < U 

a a “ 


"where Iff is the - maxinuun (expected) number of docxments desired per 
time period and the f , are expected frequencies of term occurrence 
for the same time' period. 


Methods for solving systems of linear pseudo-Boolean. inequal- 
ities are discussed by Hammer Eudeanu. 


9.42 Solving the Nonlinear Pseudo-Boolean Inequality 


As mentioned in section 9.31s choice of other than single-term 
features leads to a pseudo-Boolean inequality '(■rhich has the form 


z 

i=l 


a.f. > (t 

a a “• 




A.S an example , consider 




This nonlinear inequality may be solved bj'- using simple extensions of 
the methods used for linear inequalities in chapter, 6. Seej for in- 
stance, Hammer and Eudeanu^^*^^^ . To solve the nonlinear inequality, 

define ne'w binary variables y. : 

0 



Then solve the linear inequality given by 



a.y . > (t - a ) . 
0 J “ o 


After the m solution families ^ ~ are . obtained 

for this linear inequality, the original variables are substituted 
into the expressions for the linear families ^j^(y) as follows 

Fj^(jr)<-Fj^(T)»- 

Finally, after simplifying the resiilting egressions for' F„(T) , we 
have the desired solution families' for the nonlinear inequality. Thus 
the specification of multi-term features does not introduce severe com- 
putational difficulties. 


9.5 Dei’ivation of the LUPF 

Several modifications and extensions are discussed below, all ■ 
of which retain the linear model for predicting document utility. 
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9.51 CSioice of norm 

Parameters in the LUPP' could he estimated from the' training set 

hy using the minimal value of the L or norm as a measure of 

5 ° 2 

goodness of fit instead of the minimal norm. The prohlem 

also has a formulation as a linear programming problem^^^^*^^'^^ , 

9.52 Selection 'Among Alternate Optimal Solutions 

Both L^- and problems suffer from the ’disadvantage' of 

admitting alternate optimal solutions . This .could be used to advantage 
by selecting among alternate optimal solutions as a post-optimal pro- 
cedure. A secondary function based on frequency of term occurrence 
could be used for this pizrpose. 

9.53 • Choice of Independent Variables 


The- choice of independent variables x. , tras very simple for 

the problem tested. Here x. .e{0,l} depending on whether or not a 

ij 

feature (term) j is .present -with document i. A simple extension is 

to let X. .e{0,l,2} where now x. . = 1 if term ,1 is -a minor term 
Id ij 

xd-th document j and = 2 if term J is a term. (See 

section 9 . 33 .) 

\fhen the LUPF (formed using x. .e{0jl>2}) is thresholded, ic no 
longer gives a pssudo-Eoolean inequality. This difficulty can be over- 


come by converting the integer inequality to an equivalent system of 


pseudo-Boolean inequalities. See, for instance. Hammer and 

D (112) 

Rudeanu 
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9.5^ Choice of Dependent Variables 

I 

The dependent. variable is document utility. In the test 
configuration y^e{l,2 j ■ • *9}. A much simpler form and one which might 
work just as well would be to let y^e{~l5+l} as a measure of rele- 
vance for documents in the Training -set . Then a value of t — 0 
coilLd be used to form the Boolean ineq_uality, 

9.55 BP Problems with Unequal Slack Costs 

With the approximation problem formulated as a linear pro- 

gramming problem, the initial basis is composed entirely of slack 
vectors. As these slack vectors are driven out of the basis the L.| 
■norm is minimized; When each slack vector has unit weight (or cost) 

I 

in the objective function, there is no preference given to one slack 
vector over another. Each has an ec^ual opportunity -to be driven from 
the basis. Every slack vector is associated with one row of the con- 
straint matrix, which represents a single document in the training set. 
When a slack vector is driven out of the basis , the residual for this 
row drops to zero and a perfect fit to the predicted document utility 
is realized. 

By assigning different objective funcbion weights to slack 
A»‘ectors, it is possible to force a better fit to the part of the 
training set with the higher weights , at the expense of the parr of 
the training set with the lower weights. This .can be used in at least 


two ways. 
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9.551 Forced. Fitting, to the Relevant Dociunents . By assigning higher 
weights to slack associated with the training set which are relevant , 
and lower weights to those documents which are non-re levant , the util- 
ities of the relevant documents will he fit at the expense of the non- 
relevant ones. This may result in improved search qua.litjr. 

9.552 Application to Iterative Retrieval . With iterative retrieval 

the training set grows in size following repeated retrieval efforts on 

the same file. Consider an exponential decrease in the weights of slack 

vectors corresponding to sample documents according to the time which 

th 

they have remained in the training set (i.e. , w, = e “ for the n 
time in the training set ) . The relative importance of training set 
dooTmients decreases as they become ‘older'. Thus, the older documents 
are gra.dual.Iy ’forgotten’ , and the I.TJPF derived is more closely tuned 
to the most recently acquired members of the training set. This is one 
way to effectively limit 'the size of a large training set, and also to 
following the changing interests of a user. 

9.56 Improved Algorithms 

i’Jhile only marginally related to the document retrieval problem, 
more efficient methods of solving the approximation problem are ' 

suggested by the nature of the basis inverses arising from the LP prob- 
lem. In particular, it has been observed that elements of the basis 
inverses are integral multiples of integral powers of 1/2 when the .doc- 
ument utilities are specified as positive integers. The LP solution 
variables have been observed to be integral multiples of 1/2. 
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9.6 Experimental Investigation of Iterative Retrieval 

The ability of a docunient retrieval system to' adapt to changing 
user needs has become especially important with the advent of time- 
sharing search systems 'which allow rapid implementation of successive 
BBS’s. • 

The sjrstem tested in this dissertation has been of the 'static', 
single search type. In an iterative configuration the same file would 
be repeatedly searched a number of times, with modifications being 
made to the training set after each search. Following a sequence of 
searches, it is hypothesized that an asymptotic level of search effec- 
tiveness would he reached, which would he significantly greater than 
/ 

that of a 'single search' system. 

Test methods for use with .an iterative configuration could be _ 
the same as those employed "for the testing here,, excep'b for two com- 
plications. First, rules regarding additions and deletions to the 
training 'set would have to be established. Perhaps the size of the 
training set would be limited, with new additions forcing an equal 
'number of -deletions , Alternately, the training set size could be .im- 
restricted, and the 'older' documents 'forgotten* as outlined In section 
9.552. Secondly, a stopping rule would have to be imposed to restrict 
the number of iterations. This could he simply a limit on the allowable 
number of searches. The effectiveness ' of the' final search could become 
the dependent variable , instead of the effectiveness of the only search 


as was done here. 
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APPSHDIX A - M EXAI'IPLS PROBLEM 


To pro\^de an overview of system operation, the solution of a 

representative .prohlem is presented here. A training set of pattern 

vectors (representing documents having user assi^ed utilities) is 

processed. First , index terms are selected in a feature extraction 

operation. This is followed hy solving an approximation prohlem 

for document utility as a function of -index term 'weights’. Finally, 

the LUPF is thresholded to give an LPBI. This is 'solved for solution 

families (index term matching templates). The union of these templates 

is a BRS. Results are illustrated with actual computer output.. The 

system has "been programmed in Portran I'F for the IBM Y094/T044 Bireot 

/ ■ ' • 


Couple S 5 ’’stem.. 


A.l Input Data 


The input data to process a 28 document, training set is shown 
on Figs. A-1 to A-4. The first card read in (not shown) gives the num~ 
■her of documents in the training set (28) and the utility threshold 
(t - 3) which defines relevancy on the scale of 1-9 (integer) used to 
rate all documents in the training set. A docirnienu is considered rel- 
evant if its utility is greater than or eq.ua! to 3 and not relevant 
otherwise . 

For each document in the training set , the following items are 

read in : 

(a) document number (treated as an alphanumeric character string); 
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(■b) niuiiber of index terras; 

(e) user assigned utility; and 

(d) aettial index terms (also treated as alphantimerid character 
strings ) . 

The training set documents are processed in seq.uential order. 
Each document number is read and stored as a character string and 
assigned a nev number (an integer) which is used by the program for 
further processing. Fig. A~5 shows the document data summary. 

A. 2 ProQessing of Index Terms 

Figures A~6 to A-8 show an- alphabetical listing of all index 
terms occurring in the training set and their associated information 
statistics (see chapter 4), Each index berm is read in and stored as 
a character string but for all further processing is represented by an 
internal -index term number (an integer). A total of 155 index terms 
were found with the 28 documents of the training set. 

Figures A-9 to A- 11 show the same list of index terms sorted on 
their information statistics instead of alphabetically. (The larger the 
the information statistiCjthe more effective the index term is at dis- 
criminating between relevant and nonrelevant documents . ) 
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A. 3 Th.e Document-Term Matrix and Computation of R 


Figirtres A- 12 to show uhe document-temi matrix which it 

will he convenient to denote as T = (t^^ ) . Each row corresponds to 
an index term and each column represents a document in the training set. 

If index term i appears in doc\iment then t. . = Ij 
othervrise = 0. At the top of Fig. A-12 the document utilities are 

shown over the document category designation (l for a relevant document, 
0 othenfise). This category vector is formed hy applying the utility 
threshold x = 3 to the document utilities. 

To compute the information statistics, the 0/1 row vector in 
T for each index teim is compared with the 0/1 category vector in a 

A 

2x2 contingency table. The information statistic ‘R is a 3neasure of 
the similarity of the two vectors. 


A.h Solving the Norm Approximation Problem 

Index term weights are determined by solving a linear approx- 
imation problem using the norm as the criterion of goodness. This 

approximation problem is set up as a linear programming problem and 
solved using the simplex algorithm (see chapter 5). Prior to solving 
the problem, all index terms ar-e discarded except those ten having the 
liighest information statistics. Only these ten terms appear in the 
approximiation problem. They represent extracted features and are used 
to best approximate assigned document utilities as a linear combination 
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of term wei^ts. The linear programming problem has the following form: 

I 

minimize z = _c'^ 

subject to Ax = b 
and 2. ^ ^ • 

Figures A-15 to A-17 show the matrix A and the vectors ^ and 
£ which result from setting up the approximation problem using only 
the ten best terms. There are* 28 rows in the matrix A and 78 columns. 
Data is listed by columns. (a( 13,6) for example is -l.OO). Cost data 
(Cj) are listed with each matrix column. All, costs are either 0 (non- 
slack cols. 1-22) or 1 (slack cols. 23-78). The right hand side (£) 
is sho^na in Fig. A-17. j 

I 

The elements of the right hand side vector b = (b.) are the 

I ~ ^ 

utilities assigned to the documents. The first eleven columns of the 
matrix A (l,J) correspond to a constant (first column) plus the 

0/1 vectors from the document term matrix corresponding to the ten in- 
dex terms with the largest information statistics. 

Figure A-l8 shows a solution summary printed after the linear 
programming problem was solved. This figure relates the basic var- 
iable numbers (structural columns in the optimal basis) to the actual 
index terms and the slack variables. 

The value of the objective function is the length of the resid- 
ual vector in the sense (that length is = 7 in the problem 
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Based on data shovn here, the "best LUPF is: 

a = a + / a.T. (A-l) 

4,5^2 + 4 .OT 2 + 2.0Tj^ + 5-5T^ + 2.5Tg '+ 3.0T^. 

where u is the predicted utility, = 1.0 is the constant term 

weight and a. are the weights for index terms 1 to T. Although the 
J 

approximation problem was set up to determine weights of ten terms , 
only seven terms have non-zero weight in the optimal solution. This 
phenomenon is discussed in chapter 5. It occurs hecause of linearly 
dependent index term columns in the original structural matrix. Fig. 
A-I 9 shows a computation of residuals using the derived utility pre- 
diction equation. A comparison can easily he 3iiade between the user 
assigned document utilities and the utilities predicted by the linear 
model. For example , document ten has an assigned utility of four and 
a predicted utility of three. 


solved here). 


1.0 + 4.0T, 


A. 5 Solving the LPBI 

The LUPF derived previously can now be thresholded to give an 
LPBI (see chapter 6). Using the threshold x = 3 read in with the 
data, we get 


4.0T^ - 4.5Tg + 


U.OT^ + 2, 


OTj^ + 5.5T^ + 2.5Tg 


3-OT.^ 3. 2.0 


(A-2) 
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Before this LPBI can he solved, it is necessary to convert all 
coefficients to integers . Mixltiplj'-ing the inequality hy. 10 gives 

- 45Tg + kOT^ + 20T^^ + + 25Tg + 30T^ 1 20 . (A-3) 

These data are summarized in Fig. A-20. 

(The notation used in the program here to describe the para- 
meters of the LPBI (A-3) on Fig. A-20 is slightly different than that 
used in chapter 6. The exponents ctj given in (6-2) are referred to 
as COMPLEMEIST(j) in the program here. Also, when = 1, COMPLEMENT 

(J) - 0.) 

The next step in the solution of the LPBI is to convert it to 
canonical form (see chapter 6). This form has no negative coeffic- 
ients, and adl coefficients are sorted according to magnitude. The 
coefficients of the canonical form are also shown in Fig. A~20. 

The branch-and-exclude algorithm described in chapter 6 gives 
17 basic solutions to the canonical form. These are shown in Fig. A-21A. 

The basic solutions are converted to canonical families of- 
solutions and then transformed back to their original (non- ! 

canonical) form. The 17 non-canonical families of solutions are . 
shot-m on Fig. A-21B, Each solution family represents a Boolean template 
of index terms which can be used for retrieving from an inverted file. 

The I's are interpreted as the required presence of a term, the O's in- 
dicate the required absence of a term and the 2’s indicate indifference 
as to whether the term is present or absent, Tne I's and O's corre- 
spond to fixed variables, while the 2's correspond to free or arbitrary 
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variables. For example > solution family 12 specifies the retrieval of 
all documents which have term 5 present and term 2 absent , and with 
• indifference as to whether] terms 1 , 3 , 4,657 are present or not. The 
complete BRS is given by the union of all solution families. 

A. 6 Miscellaneous Results 


Hear the right margin of the page on Fig. A--21B are shoi-m-the 
variables MIN, BASE, MAX and SIZE, which pertain to each of the" 
solution families listed near the left margin of Fig. A-21B. The var- 
iables 1411}, B^E and MAX are related to the range of predicted utilities 
associated with each. of the solution families.' (See Section 6.73) The fol- 
lowing terminology is introduced to describe this relationship. 

¥e are given the LPBI from 'the linear programming solution 

(A-2) : I 



a.T.- > (t 



(A-4) 


V 

We 'm'uLtiply "bhis ineq^uality hy the appropriate constant y, givi-ng 
a new iner^uality (iA-3,) icLth integer coefficients : 


where 


n 

a.T. > (t* a*) - 

J J - o 


= ya , j=0,l,2,v,n (A-S) 

J J 



T" = yx 


and 
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(In ‘the sao^le problemi y = 10, = 30, n = T and a* = 10 from (A-l) 

.throng (A-3).) Next, we solve this inequality for its M families of 
solutions » K=1,2,**»,M. (in the example problem, M = 17.) 

Designate the set of fixed indices j associated with the k ' 
solution family as and the set of free indices as (For ex- 
ample with k = 12 j = { 2 , 5 } snd ~ {1,3,4,6,7}. ) Kow define 

for each family k the following: ' 


B/^E(k) 


= ; a»T.: 

Z__ J a 


. i'eS. 


kl 


MAX(k) = maxi 


k2Li=l j 


(A-6)- 


(A-7) 


and 


n 


MEN(k) = 


mini 




j=l 


(A-8) 


(For the sample problem, BASE (12) = 55, MAX(l2) = 210 and MIN ( 12) = 

55^ as shown on Fig. A-21B.) 

Quantities (A-6) through (A-8) can be related. to the end points of 

^ til 

th^-range of predicted utility u(k) .for the kr— -solution family -of' the •• 
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original inequality (A-l) as follows; 


niin'u(k) = min 
jeS. 


k2 


n 




a.T. 
3 3 


a* + min 




k2 \ J=1 


IL 


a*T. 

J J, 


= ~ [a* + Misr(k)]; (a-9) 


and' max u(k) = max 

jeS, 


k2 


” n 

a + ; a.T. 
O 3 3 


j=l 


r 



L. 



“[a» + MAX(k)].. (A-10) 


For the sample pro'blem, using (A-9) and (A-IO) gives: 

min u('12) = ~ tlO + 55] = 6.5 (A-^ll) 

sii'i max u(l2) = ^ [10 + 210] = 22. 

Thus we have 6,5 ^ n(l2) ^22. In a similar manner ranges of pre- 
dicted utility can he esta'blished for each of the solution families 
shown in Fig, A-21B hy using (A-9 ) , (A-IO) and the given data. 

BASE(k) is used as a preliminary result in the computation of 
M3j?(k) and ‘]MA>[(h)'. To illustrate this, consider 
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r n. 

l^AX(k)'= max ) a*T I 
\2Lj=1- J 



a=fT. 

3 J 




a^T. 

3 3 


+ 'max 


k2 




= BASE(k) + max 


k2 



(A-12) 


A siniilar result holds for Mlll(k). 

The SIZE of a solu-tion family is defined as the number of 
I's in it» This variable is shoira jon Hg. A-21. Each 1 specifies the 
required presence of an index term in any document vector which would 
match the family (or template).. Very roughly, the probability P ' of 
finding a document which matches a given template is given by (see 
section 8.42) 


P (match) - e 


(A-13) 


where p is the average probability that an index term will he used, 
and s is the SIZE of the family. The larger the SIZE of a solution 
family, the greater are the chances that no documents will he foiind 
which will match it. 



Eacti solution faxoily has the pleasant property that any docu- 
ment retrieved using it ■will not he retrieved hy stny other reduced so3.~ 
ution family.. Q?hts can he verified hy noting that each solution family 
of Fig. A-21 differs from the others hy at 'least one 1 being changed to 


0 or vice versa. 
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DOCUMENT INDE/l TEI'i DOCUMENT 

JfliMBER COUNl’ UTILITY 

68N10674 10 07 

61 DUlOGllAPHlcS 

CONTAHINANTS 

KJCROWAVE Sf^ECTRA 

MOLECULAR STRUCTURE 

SPACfcCPAFT CABIN ATMOSPHERES 

6BN12280 07 01 

EIRE PREVEM[QM 

MISSILE SILOS 

OXYGEN 

CATEGORY 11 

68N12612 07 05 

CAPACITORS 

INSULATORS i 

SEMICONDUCTING FILMS 
CATEGORY 9 

68N152Q6 13 09 

AIRCRAFT SAFETY 
DiSPLAy DEVICES 
FIRE PREVENTION 
INTEGRATED CIRCUITS 
HICRCELECTRONICS 
ULTRAVIOLET RADIATION 
CATEGORY 8 

68N156P0 06 01 

ACCIDENT INweSHGATlON 
CABIN ATMOSPhERES 
OXYCCN BREATHING 

68N16903 11 05 

AIR 

GAS MIXTURES 
IGNiriQN 

IGNITION TEMPERATURE 
SPACECRAFT CONTAMINATION 
CATEGORV 19 

6SN17367 11 01 

CABIN ATMOSPHERES 

FIRES 

FLIGHT HAZARDS 
IGNITION 
OXYGEN 
CATEGORY 31 

68N17350 ' 16 01 

EMERGENCY LITE SUSTAINING SYSTEMS 
FIRE PREVENTION 
FLAKE prorogation 
HUMAN FACTORS ENGINEERING 
IGNITIUN TEMPERATURES 
SPACE ENVIRONMENT SIHULATIOH 
SPACECRAFT CABIN ATMOSPHERES 
SPONTANEOUS COMBUSTION 


CHEMICAL ANALYSIS 
INORGANIC COMPOUNDS 
MOLECULAR SPECTROSCOPY 
ORGANIC COMPOUNDS 
category 23 


HAZARDS 

nonflammable materials 
SAFETY DEVICES 


DETECTORS 

METAL OXIDE SEMICONDUCTORS 
THIN FILMS 


COMPUTER DESIGN 
FAILURE 

INFRARED OBTCCTCRS 
LOGIC CIRCUITS 

TEMPERATURE HEASURtHG iriSTRUHENTS 
HARMING SYSTEMS 


APOLLO SPACECRAFT 
FIRES 

CATEGORY 11 


ALTITUDE 

HVOKOGSN 

IGNITION LIMITS 

SPACECRAFT CABIN ATMOSPHERES 

TEMPERATUi?E DISTRIBUTION 


EXTRATERRESTRIAL RESOURCES 

FLAME PROPOGATION 

HEI lUH 

NITROGEN 

STORAGE 


ENVIRONhEHTAL TESTS 

FIREPROOFING 

HELMETS 

HUMAN FACTORS LABORATORIES 
MATERIALS TESTS 
SPACE SUITS 
SPECIFICATIONS 
CATEGORY 5 


FIGURE A-l 

INPUT DATA FOR SAMPLE PROBLEfA 
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3X>CUI4E!n? INDEX TEllM DOGUKEi'IT 
KUNBCR CO'OiH UTILITY 

6Sfll?92*l 13 02 

BURKING RATE 

riREBRCOFiKG 

HAZARDS 

plastics 

SPACECRAFT CABINS 
SPACECRAFT CONTAHIHATION 
CATEGORY 5 

6BN187A4 28 04 

ACCIDENT INVESnCATION 
BURNS UN JURIES! 

CONFERENCES 

electrical faults 

FIRE control 
FIREPROOFING 
FREON 

GLASS FIBERS 

HUNAN FACTORS ENGINEERING 
OJCYGEN 

PROTECTIVE CLOTHING 
SPACE SUITS 

SPONTANEOUS COMBUSTTON 
PRESSURE CHANaSRS 

6SN187A5 13 01 

ACCJDENT IHVFSTlGATroM 

CONFERENCES 

HIGH pressure OXYGEH 

PRESSURE COAHBERS 

SPACECRAFT CABIN SIKULATDRS 

FLECTRICAL FAULTS ^ 


66N18746 14 03 

CABIN ATMOSPHERES 

ENERCtVCY LIFE SUSUlMiNG SYSTEKS 

FIRE EXTINGUISHERS 

HIGH PRESSURE OXYGEN 

nokflahhable materials 

SAFETY devices 
SURVIVAI 

68K1C747 12 03 

ACCIDENT PREVENTION 
COK‘=Efi£KCSS 
FIRS CONTROL 

kuhmn factors engineering 

PROTECTIVE CLOTHING 
SPACECRAFT CABIN SIMULATORS 

6BNIB750 12 01 

ACCIDENT PRcVC-NTION 
EHcRGCNCY life sustaining SYSTEMS 
FIRE EXTlNGUlSHcRS 

HUNAN FACrCRS ENGINEERING 

SAFETY DEVICES 
SPOMmI.EOUS cuhbustiov 


CONTAnI HANTS 
FLAMMABILITY 

outgassing 

SPACrCRAFT CABIN ATMOSPHERES 
SPACECRAFT CONSTRUCTION MATERIALS 
TOXICITY 


ACCIDENT PREVENTION 

CABIN ATMOSPHERES 

CONTROLLED ATMOSPHERES 

EMERGENCY LIFE SUSTAINING SYSTEKS 

FIRE EXTINGUISHERS 

FLAKKAbILITY 

GAS COMPOSITION 

HIGH PRESSURE OXYGEN 

nonflahhable materials 

PRESSURIZED CABINS 
SAFETY DEVICES 
SPACECRAFT CABIN SIMULATORS 
THERMAL INSULATION 
CATEGORY 5 


CHEMICAL ANALYSIS 
FIRES • 

HUMAN PATHOLOGY 
RESIDUES 

SPONTANEOUS COMBUSTION 
FLAKHABILITY 


i conferences 

FIRL CONTROL 
FIREPROOFING 

HUMAN FACTORS ENGINEERING 
PROTECTIVE clothing 
SPACECRAFT CABIN SIMULATORS 
CATEGORY 5 


CABIN atmospheres 

EHEPGENCY LIFE SUSTAINING SYSTEMS 

FIRE EXTINGUISHERS 

PRESSURIZED CABINS 

safety devices 

• CATEGORY 5 


CABIN ATMOSPHERES 
FIRE CONTROL 
GAS COMPOSITION 
PROTCCTiVE CL0THTf4G 
SPACECRAFT CABIN SIMULATORS 
CATEGORY S 


FIGURE A-2 

INPUT DAfA FOR SAMPLE PROBLOvi 
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IX)CU/4aNT I1;DEX TEPJl DOOUXHNT 

NUMBER COUNT UTILITY 

68N1875L l<i 01 

C&&1N /^TKOSPH&ftES 

tueRGENCY LIFE SUSTAINING SYSTEHS 
FI REPROOF I MG 

K'JKAM FACTORS ENGiNEERING 
MDNFLANKABl E MATERIALS 
SAFETY DEVICES 
SPONTANEOUS COMBUSTION 

6BN20005 12 01 

CNVIRONKEMT SIhULATION 
FLAKS propagation 

flakhable gases 

HIGH PRESSURE OXYGEN 
IGNITION 

PROTECTIVE. CLOTKIHC 

6BN20058 12 01 

AIRCRAFT HA2AR0S 

CARBON TETRAFLUORIOE 

DlfLUORO COMPOUNDS 

FIRE FIGHTING 

METKANC 

PYROLYSIS 

CBN 208 70 10 01 

COMBUSTION 

FIRES 

HAZARDS 

PROTECTIVE CLOTHING 
SPACCCRAFT CfJVIRONKerrTS 

68N21752 11 01 

FIRE PREVE:!T!0N 
FLAKE PROPAGATION 

human factors laboratories 

NOHrLAKKABLE MATERIALS 
PROTECTIVE CLOTHING 
CATEGORY 5 

68NZA75C 15 01 

siBUOGPAPmes 

flakhadility 

HEAT TRANSFER 
hUMAN TOLERAKCCS 
SPACECRAFT CAOIN ATMOSPHERES 
SPACECRAFT CONTAHIHATION 
TOXIC HAZARDS 
CATEGORY 5 

6flfc2A871 10 01 

CONFERENCES 

FIRES 

IGNITION LIMITS 
SPONTANEOUS COMBUSTION 
UNITED STATES OF AMERICA 

68N29668 07 06 

AIRCRAFT SAFETY 

ELECTROLYTES 


CONFERENCES 
FIRE CONTROL 
GAS COMPOSITION 
MATERIALS TESTS 
PROTcCTIVE CLOTHING 
SPACECRAFT CABIN SIMULATORS 
CATEGORY 5 


FIPvE PREVENTION 
FLAMMABJUTY 
FLASH POINT 

HUMAN FACTORS LABORATORIES 
PRESSURE BISTRiaUTION 
CATEGORY 5 


8RDHIME COMPOUNDS 
CHLORINE FLUORIDES 
FIRE EXTINGUISHERS 
HALOGEN COMPOUNDS 
OXYGEN 
CATEGORY 6 


EXPLOSIONS 

FLAMMABILITY 

OXYGEN 

safety 

CATEGORY 


FIREPROOFINS 

rCftlUlAeiLITY 

MICE 

OXYGCri 

SPACECRAFT CABIN ATMOSPHERES 


FIRE EXTINGUISHERS 

flight crers 

HIGH PRESSURE OXYGEN 

LIFE SUPPORT SYSTEHS 

SPACCCRAFT CONSTRUCTION MATERIALS 

STATIC ELECTRICITY 

MEIGHTLESSNESS 


FIRE PREVENTION 
GREAT BRITAIN 

SPACECRAFT CABIN ATMOSPHERES 

THERAPY 

CATEGORY 5 


ELECTRDCHEPICAL CELLS 
FIRE PREVENTION 


FIGURE A-3 

INPUT DATA FOR SAMPLE PROBLEA'l 



DCCUHEHS 

HWBER 


INDEX TEFJ'! DOLiUI'lEN? 
COUNT UTILITY 


TEMPERATURE SENSORS 
CATEGORY 14 

■68H29947 10 03 

calibrating 

CURRENT AMPLIFIERS 
INERTIA 

TEMPERATURE MEASURING INSTRUMENTS 
TRIQDES 

68N30134 07 01 

BURNING RATE 
FLAHHABILITY ' 

SPACECRAFT CABIN ATMOSPHERES 
CATEGORY 33 

6BH34881 U 08 

ATMOSPHERIC COMPOSITION 
ELECTRICAL PROPERTIES 
ORGANIC COMPOUNDS 
SEHICONDUCTIMG FILMS 
SPACECRAFT CONTAMINATION 
CATEGORY 5 

6SN36272 07 01 

AIRCRAFT FUEL SYSTEMS 

EXPLOSIONS 

IGNITION 

CATEGORY 2 

60N36274 12 01 

AIRCRAFT FUEL SYSTEMS 

COMMERCIAL AIRCEIAFT 

ELECTRIC DISCHARGES 

FUEL TANKS 

LIQUID NITROGEN 

VENTS 

60(136275 08 01 

AIRCRAFT FUEL SYSTEMS 
aircraft INDUSTRY 
FIRE PREVENTION 
SAFETY DEVICES 


WARNING SYSTEMS 


CORRECTION 
GAS FLOW 

SEMICONDUCTOR DEVICES 
TEMPERATURE SENSORS 
CATEGORY 14 


FIRE PREVENTION 

IGNITION TEMPERATURE 

SPACECRAFT CONSTRUCTION MATERIALS 


CLOSED ECOLOGICAL SYSTEMS 
gas ANALYSIS 
POLYMERIC FILMS 
SPACECRAFT CABIN ATMOSPHERES 
THIN FILMS 


CONFERENCES 
FIRE PREVENTION 

polyurethane foah 


CARBON- dioxide 
CONFERENCES 
FIRE prevention 

lightning 

SAFETY DEVICES 
CATEGORY 2 


AIRCRAFT HAZARDS 
CONFERENCES 
JET AIRCRAFT 
CATEGORY 2 


FIGURE A-4 

INPUT DATA FOR SAMPLE PROBLEM 



DOCUMENT DATA 


NO. OF DOCUMENTS PR0CSSSED=28 
CATEGORY THRESHOLD= 3 


(DOCUMENTS 

WITH WEIGHTS GREATER THAN OR 

EQUAL TO THRESHOLD ARE IN 

category u 


PROGRAM 

ACTUAL 

DOCUMENT 

DOCUMENT 

NO. OF 

NEW 

OQC. NO. 

DOC. NO. 

WEIGHT 

CATEGORY 

TERMS 

TERMS 

1 

68N10674 

7 

1 

10 

10 

2 

68N12280 

1 

0 

7 

7 

3 

$8N12312 

5 

1 

7 

7 

4 

68N15206 

9 

J. 

13 

12 - 

5 

68N15620 

1 

0 

6 

5 

6 

68N16903 

5 

1 

11 

10 

7 

68NJ.7367 

1 

0 

11 

7 

8 

68NI7380 

1. 

0 

16 

13 

9 

68N17925 

2 

0 

13 

7 

iQ 

68N18744 

4 

X 

28 

16 

XI 

68N18745 

' 1 

b 

13 

2 

12 • 

68N13746 

3 

1 

14 

1 

13 

68NIB747 

3 

1 

12 

0 

14 

68N18750 

1 

0 

12 

0 

15 

68N18751 

X 

0 

14 

0 

16 

68N2O0O5 


0 

12 

5 

17 

68N20058 

1 

0 

12 

10 

IS 

68N20870 

1 

0 

10 

5 

19 

68N21752 

1 

0 

11 

I 

20 

fe8N24?56 

i 

0 

15 

T 

21 

68N24871 

1 

0 

10 

3 

22 

63N29668 

6 

1 

7 

3 

23 

68N29947 

3 

1 

10 

7 . 

24 

68N30134 

1 

0 

7 

0 

25 

68N34881 

8 

1 

11 

5 

26 

68N36272 

i 

0 

7 

3 

27 

63N36274 

1 

0 

12 

7 

26 

66N36275 10 8 

FIGURE /^.-5 

DOCUMENT DATA SUMMARY FOR SA.MPLE PROBLEM^ 

2 



moex TCRM DATA 

alphadetical sort * 


NO. OF TERNS DISCQVEFLE0=155 


SOURCE ENTROPY 

0.940 

' 

PROGRAH 

INDEX 

iNFORMATION 

TERM KO. 

TERM 

STATISTIC 

37 

ACCIDENT INVESTIGATION 

D«Q9D22 

79 

ACCIOEKT PREVENTION 

0.03441 

Z5 

AIRCRAFT SAFETY 

O.U340 

103 

AIRCRAFT HAZARDS 

0.04771 

144 

AIRCRAFT euEL SYSTEMS 

0.07337 

154 

AIRCRAFT INDUSTRY 

0.02329 

* 42 

AIR 

0.05479 

43 

ALTI rODE 

0.05479 

3d 

APOLLO SPACECRAFT 

D.D2329 

L39 

ATMOSPHERIC COMPOSITION 

0.05479 

I 

61BU0GKAPKIE5 

0.00474 

104 

BROMINE CQH<’QUNOS 

0.02329 

72 

BUkNlKG RATE 

0.04771 

BO 

BURNS I injuries; 

0.05479 

39 

CABIN ATMOSPHERES 

0.00526 

132 

CAL I fiR ATI NO 

0.05479 

16 

CAPACI TORS ‘ 

0.05479 

IQS 

CARBON TETP.AFLUORIOE 

0*02329 

147 

CARBON DIOXIDE 

0.02329 

10 

category 23 

0.05479 

17 

CATEGORY 11 

0.04771 

24 

CATEGORY 9 

0,05479 

36 

CATEGORY 3 

0.05479 

51 

CATEGORY 14 

0.17649 

58 

CATEGORY 31 • 

0.02329 

71 

category 5 

0.00669 

112 

LATcGGR/ 6 

0.0?329 

117 

CATEGORY 33 

0*0v77i 

146 

CATEGORY 2 

0.07337 

2 

CHEMICAL ANALYSIS 

0.004T4 

106 

CHLORINE FLUORIDES 

0.02329 

140 

CLOSED ECOLOGICAL SYSTEM 

0.05479 

113 

. COMBUSTION 

0-02329 

148 

COHMCRCIAL AIRCRAFT 

0.02329 

26 

COMPUTER DESIGN 

0.05479 

81 

conferences 

0.00085 

3 

CONTAMINANTS 

0.00474 

62 

CONTROLLED ATMOSPHERES 

0.0547? 

133 

CORRFCTIon 

0-05479 

134 

CURRENT amplifiers 

0.05479 

19 

DBTCCTORS 

0-05479 

107 

OIFLUORO CCHPOUrlOS 

0.02329 

27 

DISPLAY DEVICES 

0.05479 

83 

ELECTRICAL FAULTS 

0.00474 

129 

tLECTROCHCRICAL CELLS 

0.01479 

130 

ELECTROLYTES 

0-05479 

141 

ELECTRICAL PROPERTIES 

0.05479 

149 

ELECTRIC discharges 

0.02329 

59 

EHERGENCY LIFE SUSTAIMIN 

0.01698 

60 

environmental tests 

0.02329 


FIGURE A-6 

ALPHABETICAL LIS™G OF INDEX TCRMS IN SA/^PLE PROBLEA^ TRAINING SET 



INDEX TJSRM 

TER>: 

iKFoni-Aa'iou 

NUMBER 


STATISTIC 

93 

ENVJRONHCNT SlKULATION 

0.02329 

IK 

EXPLOSIONS 

0.09771 

52 

EKlftATfcRRESTRlAL RESOURC 

0.02329 

ZB 

FAILURE 

0.05979 

61 

FIREPROOFING 

0.00099 

11 

FIRE PREVENTION 

0.06593 

AD 

FIRES 

0.12897 

fiA 

FIRE CONTROL 

0.D3867 

83 

FIRE EXTINGUISHERS 

0.0X698 

]OB 

FIRE FIGHTING 

0.02329 

53 

PLANE PROROGATION 

0.09771 

09 

FLAKE PROPAGATION 

0.09771 

73 

FLAHHARILl TV 

0.D7586 

lOD 

FLAHKAbLE CASES 

0.02329 

101 

FLASH POINT 

0.02329 

SA 

FLIGHT hazards 

0.02329 

119 

FLIGHT CRFNS 

0.02329 

Ob 

FREON 

0.05A79 

150 

FUEL TANKS 

0.02329 

87 

CAS COMPOSITION 

0.00022 

K2 

GAS AN At VS IS 

0.05979 

135 

GAS FLOW 

0.0S479 

AA 

GAS MIXTURES 

0.05^79 

68 

GLASS FIBERS 

0.05979 

126 

GREAT BRITAIN 

0.02329 

109 

HALOGEN COMPOUNDS 

0.02329 

12 

HAZARDS 

0.07337 

120 

HEAT TRANSFER 

0.02329 

55 

HBUOH 

0.02529 

62 

HELMETS 

0.02329 

89 

HIGH PRESSURE OXYGEN 

O.UOlR^r 

, 63 

human factors ehciheerjn 

0.01693 

6A 

HUMAN FACTORS LABORATORI 

0.07337 

95 

HUMAN PATHOLOGY 

0.023P9 

121 

HUMAN TOLFRA«C*^S 

0*02329 

A5 

HYDROGEN 

0. 05^/9 

A6 

IGMITICN 

0.00630 

47 

IGNITION LIMITS 

O.OOi-74 

A6 

IGMIYICV temperature 

0.00^74 

65 

IGNITION TEMPEP.ATURES 

0.02329 

136 

INERTIA 

0.05479 

29 

INFRARED DETECTORS 

0.05479 

A 

INORGANIC COMPOUNDS 

0.05479 

•20 

INSULATORS 

0.05479 

30 

integrated CiRCUITS 

0.05479 

155 

JET AIRCRAFT 

0,02329 

122 

LIFE SUPPORT^SYSIEHS 

0.02329 

J51 

LIGHTNING 

0.02329 

152 

LIQUID NITPOGEM 

0.02329 

31 

LOGIC CIRCUITS 

0.05479 

66 

materials tests 

0.O47T1 

21 

METAL OXIDE S6KIC0N0UCTQ 

0.05479 

llO 

HETIUNE 

0.02329 

118 

MICE 

0.02329 

5 

MICROWAVE SPECTRA 

0.05479 

32 

KtCROELECTRONICS 

0.05479 

13 

hlSSILE SILOS 

0.02329 

6 

HOLSCULAR SPECTROSCOPY 

0,05^79 

7 

KOLECULAft STRUCTURE 

0-05479 

56 

NITROGEN 

0.02329 


FIGURE A-7 

ALPHABeiCAL LISTING OF INDEX TERMS IN SAMPLE PROBLEM TRAINING SET 



IDEX TERM 

aiDEX TEILM 

IHPORMATIOK 



STATISTIC 

14 

nonflammable materials 

0-00124 

8 

ORGANIC COMPOUNDS 

0.11343 

74 

OUTGASSlNG 

0.02329 

IS 

0X7GEN 

0.0341Z 

41 

0X7GEN BREATHING 

0.02329 

75 

PLASTICS ■ 

0.02329 

143 

POLYMERIC FILMS 

0.05479 

1'45 

POLYURETHANE FOAM 

0.02329 

90 

PRESSURIZED CABINS 

0. 1L340 

94 

PRESSURE CHAMBERS 

0.00474 

102 

PRESSURE DISTRIBUTION 

0.02329 

91 

PROTECTIVE CLOTHING 

0.00043 

111 

PYROLYSIS 

0.02329 

96 

RESIDUES 

0.02329 

15 

SAFETY DEVICES 

0.0D04D 

115 

SAFETY 

0.023Z9 

22 

SEMICONDUCTING FILMS 

0. 1L340 

137 

SEMICONDUCTOR DEVICES 

0.05479 

9 

SPACECRAFT CABIN ATIiOSPK 

0.00085 

49 

SPACECRAFT CONTAMINATION 

0.01032 

67 

SPACE ENVIRONMENT SIMULA 

0.02329 

68 

SPACE SUITS 

0,00474 

76 

SPACECRAFT CABINS 

0.02329 

77 

SPACECRAFT CONSTRUCTION 

■0.07337 

92 

SPACECRAFT CABIN SIMULAT 

0,01698 

116 

SPACECRAFT ENVIRONMENTS 

0.02329 

69 

SPECIFICATIONS 

0.02329 

70 

SPONTANEOUS COMBUSTION 

0=03412 

123 

STATIC ELECTRICITY 

0.02329 

57 

STORAGE 1 

0.02329 

97 

SURVIVAL 

0. OS 479 

33 

TEMPERATURE MEASURING IN 

0.11340 

50 

temperature DISTRIBUTION 

0.05479 

131 

TEMPERATURE SENSORS 

0.11340 

127 

THERAPY 

0-02329 

93 

thermal INSULATION 

0.05479 

23 

THIN FILMS ■ 

0.11340 

78 

TOXICITY 

0.02329 

124 

TOXIC hazards 

0.02329 

138 

TRIQDES 

0.05479 

34 

ULTRAVIOLET RADIATION 

0.05479 

126 

UNITED STATES OF AMERICA 

0.02329 

153 

VENTS 

0.02329 

. 35 

HARNING SYSTEMS 

0-11340 

125 

WEIGHTLESSNESS 

0.02329 


FIGURE A-8 

ALPHABETICAL LISTING OF INDEX TERMS IN SAMPLE PROBLEM TRAINING SET 
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INDEX TERM DATA 
INFO. SfAr, SORT 


HO. OF TERMS D!SCOVERED=155 
SOURCE ENTROPY HU)=« 0.9A0 


PROGRAM 

iKoex 


inpor'hatioh 

TERR WO. 

TERR 


STATISTIC 

SI 

CATEGO“.V 14 


0.17649 

40 

FIRES 


0.12897 

151 

temperature sewsors 


0*11349 

90 

pressurized CABirtS 


0.11340 

35 

WARNING SYSTEMS 


0.1134a 

33 

TEMPERATURE MEASURING IN 


0.11349 

25 

AIRCRAFT SAFETY 


0.11340 

23 

THIM FILMS 


0.11340 

22 

SEHICONOUCTIN6 FILMS 


0.U340 

a 

ORGANIC COMPOUNDS 


0.11349 

73 

FLARHABILITY 


0.07586 

146 

CATEGORY 2 


0.07337 

14A 

AIRCRAFT FUEL SYSTEMS 


0,07337 

77 

SPACECRAFT CONSTRUCTION 


0.07337 

64 

HUMAN FACTORS LADORATORl 


0.07337 

12 

HAZARDS 


0.07337 

11 

FIRE PREVENTION 


0.06593 

143 

POLYMERIC f’ILHS 


0.05479 

1X2 

GAS ANALYSIS 

• 

0.05479 

lAl 

ELECTRICAL PROPERTIES 


C. 05473 

140 

CLOSED ecological SYSTEM 


0.05473 

13^ 

AThOSPHSAlC COMPOSITION 


0.05479 

138 

TRIODES 


0.05479 

137 

SeHlCONDUCTOR DEVICES 


0,05477 

136 

INclUIA 


0.05479 

138 

GAS FLOW 


0.05479 

134 

CURRENT AHPLIFICRS 


O.OS4T9 

133 

CDKftcCiluH 


O- 9^479 

132 

CALIBRATING 


0.05479 

130 

ELECTROLYTES 


0.05479 

129 

ELECTROCllEillCAL CELLS 


0.65479 

97 

SURVIVAL 


0.05479 

93 

THERMAL INSULATIOil 


0.05479 

8S 

.GLASS fibers 


0.05479 

86 

FREON 


0. 05479 

82 

CONIROLLED ATItOSPMERGS 


0.05479 

80 

burns (INJURIES) 


0.05479 

50 

TEMPERATURE DISTRIBUTION 


0.05479 

45 

HYDROGEN 


0.05 4 TV 

99 

GAS MIXTURES 


0.05479 

93 

altitude 


0.05479 

42 

AIR 


0.05479 

36 

CATEGORY 0 


0.05479 

34 

ULTRAVIOLET RADIATION 


0.05479 

32 

MICROELECTRONICS 


0.054/9 

31 

LOGIC CIRCUITS 


0,05479 

30 

INTEGRATEO CIRCUITS 


0.05479 

29 

INFRARED DETECTORS 


0.05479 

28 

FAILURE 


0.05479 

27 

DISPLAY DEVICES 


0.05479 


FIGURE A-9 

• INFORMATION STATISTIC SORT OF INDEX TERMS IN SANTPLE PROBLEfA TRAINING SET 



244 


INDEX T£RI'i 
HUMBER 

IHD2a term 

IHFORXATION 

STATISTIC 

26 

COHPlJTfR OBSIGH 

0.05479 

24 

CATEGDRY 9 

0.05479 

21 

HFFAL 0X109 SEKICOHDUCTO 

0.05479 

20 

INSULATORS 

0.05479 

19 

DETECTORS 

0.05479 

la 

CAPACITORS 

0.05479 

10 

CATECORV 23 

0.05479 

r 

KOLECULAR STRUCTURE 

0. 05479 

6 

rtOLECULAR SPECTROSCOPY 

0.05479 

5 

HiCftOHAVE SPECTRA 

0.05479 
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SASIC SOLUTION SUMMARY 

IPROGRAM TERM NO.=-l FOR SLACKiO FOR CONSTAMl ,-2 FOR ARTIF.J 
(BASIC VAR. N0.=0 FOR ARTIF. VAR.) 


BASIC 

PROGRAM 

INDEX 


TERM 

BASIC 

VARIABLE 

VAR NO. 

TERM MO. 

TERM 


WEIGHT 

TYPE 

INF. STAT, 

23 

-1 

68N10674 

SLACK 

3.00000 

REG. 


i 

0 

CONST, 


1,00000 

REG. 

0. 

9 

23 

THIN FILMS 


4.00000 

REG. 

0.11340 

15 

i31 

TEMPERATURE 

SENSORS 

-4.5000D 

REG. 

0.11340 

2 A 

-1 

68H12280 

SLACK 

0. 

REG. 


2 

51 

CAIEGOP.Y 14 


4.00000 

REG. 

0.17649 

55 

-1 

68M1S620 

SLACK 

~0. . 

REG. 


3 

40 

FIRES 


0, 

REG. 

, 0. 12897 

31 

-1 

68N17925 
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REG. 


32 

~1 
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REG. 


57 

-1 

6aN17367 
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REG. 


3A 

~i 
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90 
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REG. 


70 

— 1 

68N24756 

SLACK , 

0. 
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— i 
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TEMPERATURE 
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FIGURE A-18 
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FIGURE A-19 
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